Regular Expressions and Divisibility

A few years ago, I took a class on computational theory concurrently with a class on abstract algebra and I noticed a few interesting things:

Divisibility rules are a simple application of modular arithmetic.
Such modular arithmetic operations can be implemented in deterministic finite-state automata (DFAs).
DFAs and regular expressions have equivalent computational power.

Therefore, any divisibility rule can be written as a regular expression. In fact, there's a straightforward algorithm for doing this for any divisor and in any base system.

Divisibility

Say there is a natural number $m \in ℕ$ . When written as a series of $n$ base- $b$ digits $σ_{1}, σ_{2}, \dots, σ_{n}$ , $m$ is expressed as a polynomial as shown in Equation 1:

m = σ_{1} b^{n - 1} + σ_{2} b^{n - 2} + \dots + σ_{n}

Equation 1:

m

written as a base-

b

polynomial.

Using Horner's method, this can be re-written into the series $q_{1}, q_{2}, \dots, q_{n}$ shown in Equation 2:

q_{k} = {\begin{matrix} σ_{1} & : & k = 1 \\ q_{k - 1} b + σ_{k} & : & k > 1 \end{matrix}

Equation 2: The series

q_{k}

, derived from Equation 1 using Horner's method.

Writing the series as in Equation 2 makes it possible to process the digits in order, with $q_{n} = m$ once the series is fully evaluated.

If $m$ is divisible by $d$ , it will be in its congruence class $0$ (i.e. $m mod d = 0$ ). To test this using modular arithmetic, there needs to be a way to calculate $q_{k} mod d$ from $q_{k - 1} mod d$ . Equation 3 shows how this can be done using some properties of modular arithmetic:

\begin{array}{l} q_{k} mod d = q_{k - 1} b + σ_{k} mod d \\ = (q_{k - 1} b mod d) + (σ_{k} mod d) \\ = (q_{k - 1} mod d) (b mod d) + (σ_{k} mod d) \\ = ((q_{k - 1} mod d) b mod d) + (σ_{k} mod d) \\ = (q_{k - 1} mod d) b + σ_{k} mod d \end{array}

Equation 3: Derivation of a modular form of Equation 2 for

k > 1

This gives Equation 4 as a modular form of Equation 2:

q_{k} mod d = {\begin{matrix} σ_{1} mod d & : & k = 1 \\ (q_{k - 1} mod d) b + σ_{k} mod d & : & k > 1 \end{matrix}

Equation 4: A modular version of series

q_{k}

defined in Equation 2

DFAs

A Deterministic Finite-state Automaton (DFA) is a type of basic computing machine that accepts or rejects a string of input symbols by reading them one at a time and advancing its internal state for each input symbol using a transition function to one of a finite set of possible internal states. The divisibility problem is especially well-suited for this type of computing model—there's a string of input symbols (the digits of the number), a finite set of internal states (the congruence classes), a transition function (Equation 4), and simple acceptance/rejection criteria (numbers in congruence class $0$ ). Putting this together gives the DFA definition in Equation 5:

\begin{array}{crcl} ( \\ Q & = & {q \in ℕ : 0 \leq q \leq d}, \\ Σ & = & {σ \in ℕ : 0 \leq σ < b}, \\ δ (q, σ) & = & {\begin{matrix} q b + σ mod d & : & q < d \\ σ mod d & : & q = d \end{matrix}, \\ q_{0} & = & d, \\ F & = & {0} \\ ) \end{array}

Equation 5: DFA definition for testing divisibility

In Equation 5,

The set of states $Q$ is the set of congruence classes for $d$ plus an extra state $d$ added as the DFA's initial state.
The input alphabet $Σ$ is the set of base- $b$ digits.
The transition function $δ (q, σ)$ is defined based on Equation 4, with $q = d$ corresponding to the initial $k = 1$ case. This case is only separated for clarity since it's mathematically equivalent to the $q < d$ case.
The initial state $q_{0}$ is the extra initial state $d$ .
The set of accepting states $F$ is just the congruence class $0$ , which $m$ must be part of to be divisible by $d$ .

DFAs are often shown as graphs with the states as nodes and the input symbols as edges. Figure 1 shows a diagram of a DFA generated from Equation 5 with $b = 4$ and $d = 3$ :

Graph with 4 nodes and 12 edges illustrating DFA defined in Equation 5 for b=4 and d=3 — Figure 1: DFA generated for recognizing quaternary numbers divisible by 3.

Regular Expressions

Regular expressions are a ubiquitous tool for text searching, and have equivalent computational power to DFAs. As part of this equivalence, it's possible to construct a regular expression from any DFA by way of Kleene's algorithm. Kleene's algorithm makes use of Generalized Non-deterministic Finite-state Automata (GNFAs) as an intermediary between DFAs and regular expressions. GNFAs use regular expressions on their transition edges rather plain input symbols; Figure 2 shows the GNFA version of the DFA in Figure 1:

Graph from Figure 2 with edges in regular expression notation — Figure 2: GNFA for recognizing quaternary numbers divisible by 3.

Kleene's algorithm is similar to the Floyd-Warshall algorithm and works by eliminating nodes in the GNFA one-by-one until only one edge with the final regular expression remains. This is done by incoporating paths through each eliminated node into the regular expressions between every other pair of nodes; Figure 3 shows this process with the GNFA in Figure 2:

Four graphs with decreasing node/edge count, with the final graph only having one edge marked with a long regular expression — Figure 3: Application of Kleene's algorithm to find a regular expression for quaternary numbers divisible by 3. State $1$ is eliminated first (top right) followed by state $2$ (bottom left) and then finally the self-edge on state $0$ (bottom right).

Code

Constructing the DFA and then applying Kleene's algorithm to get a regular expression is pretty straightforward, but also pretty tedious to do by hand. Luckily, all of the math translates pretty directly to code. The first step is to define the input symbols, as shown in Listing 1:

static char digit(unsigned d) {
	if (d < 10) return d + '0';
	if (d < 36) return d + 'a' - 10;
	return d + 'A' - 36;
}

Listing 1: digit function defining input symbols.

Next, the transition function $δ (q, σ)$ is defined as in Listing 2:

static unsigned delta(unsigned q, unsigned sigma, unsigned d, unsigned b) {
	return q < d ? (q*b + sigma)%d : sigma%d;
}

Listing 2: delta function implementing the DFA's transition function.

This is then used to construct the GNFA as in Listing 3:

static string inputs_across(unsigned q1, unsigned q2, unsigned d, unsigned b) {
	string reg;
	for (unsigned sigma = 0; sigma < b; ++sigma)
			if (delta(q1, sigma, d, b) == q2) {
		if (not reg.empty()) reg.push_back('|');
		reg.push_back(digit(sigma));
	}
	return reg;
}

// ...

for (unsigned q1 = 0; q1 < d + 1; ++q1)
		for (unsigned q2 = 0; q2 < d + 1; ++q2) {
	gnfa.push_back(inputs_across(q1, q2, d, b));
}

Listing 3: Initialization of the GNFA adjacency matrix gnfa.

With the GNFA constructed, the next step is to reduce it using Kleene's algorithm as shown in Listing 4:

for (unsigned qr = 1; qr < d; ++qr)
	for (unsigned q1 = qr + 1; q1 != 1; q1 = (q1 + 1)%(d + 1))
		for (unsigned q2 = qr + 1; q2 != 1;
			q2 = (q2 + 1)%(d + 1)) {
	
	//grab a reference to the current entry from q1 to q2
	string& across {gnfa[q1*(d + 1) + q2]};
	
	//and those along the path from q1 to qr to q2
	const string& to {gnfa[q1*(d + 1) + qr]},
		&loop {gnfa[qr*(d + 1) + qr]},
		&from {gnfa[qr*(d + 1) + q2]};
	
	//add a union symbol if needed
	if (not across.empty() and not to.empty()
			and not from.empty())
		across.push_back('|');
	
	//reduce the path
	if (not to.empty() and not from.empty()) {
		across += paren_for_concat(to);
		if (not loop.empty()) across += star(loop);
		across += paren_for_concat(from);
	}
}

Listing 4: Applying Kleene's algorithm to gnfa.

Finally, only the start and accept states remain, and the edge between them is combined with the accept state's self-edge to form the final regular expression as in Listing 5:

return paren_for_concat(gnfa[d*(d + 1) + 0])
	+ star(gnfa[0*(d + 1) + 0]);

Listing 5: Combining the final two edges in gnfa to form the final regular expression.

The full code is available in divregex.cpp; it's very simple and under 100 lines in total, though the output can get pretty long since nothing is done to optimize the generated regular expressions.

Usage

The program takes two arguments: the first is the divisor and the second is the base (which is optional: the default is decimal). Listing 6 shows how to compile the program and use it to generate the regular expression found in Figure 3:

$ g++ divregex.cpp -o divregex
$ ./divregex 3 4
\b(0|3|1(0|3)*2|(2|1(0|3)*1)(0|3|2(0|3)*1)*(1|2(0|3)*2))(0|3|1(0|3)*2|(2|1(0|3)*1)(0|3|2(0|3)*1)*(1|2(0|3)*2))*\b

Listing 6: Compilation and invocation of the divregex program to generate a regular expression matching quaternary numbers divisible by 3.

Since the output is written to stdout, it is very easy to use it with other programs as shown in Listing 7:

$ egrep `./divregex 7`
12 13 14 15
12 13 14 15
5541325 4849789
5541325 4849789
123456789
123456788
123456788
7
7

Listing 7: Usage of divregex along with grep to match decimal numbers divisible by 7.

The divregex program can also be compiled with Emscripten to run in web browsers, as shown in Listing 8:

$ emcc -s ENVIRONMENT=web -s INCOMING_MODULE_JS_API='[print]' -s EXPORTED_RUNTIME_METHODS='[callMain]' -s MODULARIZE -s EXPORT_ES6 -s NO_DYNAMIC_EXECUTION -s EMIT_EMSCRIPTEN_LICENSE -s NO_FILESYSTEM -s NO_INVOKE_RUN -s ALLOW_MEMORY_GROWTH -Os divregex.cpp -o divregex.js

Listing 8: Compilation of divregex for the web context using Emscripten.

The compilation results from Listing 8 power the live demo in the next section:

Live Demo

Regular expression matching numbers divisible by in base :

…But, Why?

Why not? Sure, grep probably still isn't the best tool to use when trying to figure out whether numbers are multiples of a particular number, but the fact that it can be used for this at all is an interesting application of computational theory. It also demonstrates how surprisingly powerful regular expressions can be in some cases, especially given some of the more “obvious” things that they can't be used for.