The Myhill-Nerode theorem

One problem with the Pumping lemma is that it can be difficult to guess which \({ w }\) you need to take in order to find a contradiction. Another problem is that there are languages with the pumping property which are nonetheless not regular. There is a nice necessary and sufficient condition for regularity but it will require some preliminaries.

From languages to automata

From a language we can directly construct an automaton which recognises it, as described below. This automaton may or may not be finite.

We start with a set of tokens \({ A }\) and a language \({ L }\), i.e. a subset of \({ B }\), the set of all lists of members of \({ A }\). Let \({ ε }\) be the empty list. We define a function \({ f }\) from \({ B }\) to \({ P B }\) by \[ f ( w ) = \{ z ∈ B : w z ∈ L \} . \] We call \({ f ( w ) }\) the set of valid continuations of \({ w }\), since \({ z ∈ f ( w ) }\) if and only if reading \({ z }\) after reading \({ w }\) gives us a member of the language. Note that \({ ε ∈ f ( w ) }\) if and only if \({ w ∈ L }\). Also, \({ f ( ε ) = L }\). Both of these statements follow from the fact that \({ ε }\) is the identity for \({ B }\) with the operation of concatenation. Let \({ C }\) be the range of \({ f }\), i.e. the set of valid continuation sets.

We define an equivalence relation on \({ B }\) by saying that \({ u }\) and \({ v }\) are equivalent whenever \({ f ( u ) = f ( v ) }\). Let \({ E }\) be the set of equivalence classes. Let \({ g }\) be the function from \({ B }\) to \({ E }\) which takes each list to the equivalence class to which it belongs. Every equivalence class is the equivalence class of some list. One way to express this is to say that \({ g }\) is a surjective function, i.e. that if \({ P ∈ E }\) then \({ P = g ( u ) }\) for some \({ u ∈ B }\). If \({ P = g ( v ) }\) then \({ u }\) and \({ v }\) are equivalent, i.e. \({ f ( u ) = f ( v ) }\), so \({ f ( u ) }\) depends only on \({ P }\) and not on the particular \({ u }\) chosen. It is therefore legitimate to define \({ h ( P ) = f ( u ) }\). \({ h }\) is a function from \({ E }\) to \({ C }\). It was defined in such a way that \({ h ( g ( u ) ) = f ( u ) }\) for all \({ u }\), i.e. such that \({ f = h ∘ g }\). \({ h }\) is a injective because if \({ h ( P ) = h ( Q ) }\) then \({ P = g ( u ) }\) and \({ Q = g ( v ) }\) for some \({ u }\) and \({ v }\) in \({ B }\), but then \({ f ( u ) = f ( v ) }\) so \({ u }\) and \({ v }\) are equivalent and so \({ g ( u ) = g ( v ) }\), or in other words \({ P = Q }\). \({ h }\) is surjective since if \({ R ∈ C }\) then \({ R = f ( w ) }\) for some \({ w ∈ B }\) and then \({ R = h ( g ( w ) ) }\) and hence \({ R = g ( P ) }\) for some \({ P ∈ E }\).

Suppose \({ P ∈ E }\) and \({ w ∈ B }\). Let \[ R = \{ z ∈ B : w z ∈ P \} \] Now \({ P = f ( u ) }\) for some \({ u ∈ B }\) so \[ R = \{ z ∈ B : w z ∈ f ( u ) \} \] or \[ R = \{ z ∈ B : u w z ∈ L \} \] and therefore \[ R = f ( u w ) . \] \({ f ( u w ) }\) is a member of \({ C }\) and \({ g }\) is a bijective function from \({ E }\) to \({ C }\) so there is a unique \({ Q ∈ E }\) such that \({ R = g ( Q ) }\), i.e. such that \[ g ( Q ) = \{ z ∈ B : w z ∈ P \} . \] We can therefore define a function \({ t }\) from \({ E × B }\) to \({ E }\) by \[ g ( t ( P , w ) ) = \{ z ∈ B : w z ∈ P \} . \] An alternate way to describe this is that \({ x ∈ t ( P , w ) }\) if and only if there is some \({ u ∈ P }\) such that \({ x = u w }\).

Given any list \({ w = ( a _ 1 , a _ 2 , \ldots , a _ n ) }\) of tokens we can form the list of equivalence classes \({ ( s _ 0 , s _ 1 , s _ 2 , \ldots , s _ n ) }\) where \[ s _ 0 = g ( ε ) , \quad s _ 1 = t ( ( a _ 1 ) , s _ 0 ) , \quad s _ 2 = t ( ( a _ 2 ) , s _ 1 ) , \quad \cdots \quad s _ n = t ( ( a _ n ) , s _ { n - 1 } ) . \] By induction we have \[ g ( ( a _ 1 , a _ 2 , \ldots , a _ j ) ) ∈ s _ j \] for all \({ j }\) and so, in particular \[ g ( w ) ∈ s _ n . \] Then \[ f ( w ) = h ( s _ n ) \] Now \({ w ∈ L }\) if and only if \({ ε ∈ f ( w ) }\), i.e. if and only if \({ ε ∈ h ( s _ n ) }\). Let \[ I = \{ g ( ε ) \} , \] \[ F = \{ s ∈ E : ε ∈ h ( s ) \} , \] and \[ T = \{ ( r , a , s ) ∈ E × A × E : s = t ( s , ( a ) ) \} . \] Then \({ ( s _ 0 ∈ I ) }\), \({ ( s _ j , a _ j , s _ { j + 1 } ) }\) for all \({ j < n }\) and \({ w ∈ L }\) if and only if \({ s _ n ∈ F }\). In other words, if we form the automaton whose state set is \({ E }\), whose initial and accepting sets are \({ I }\) and \({ F }\) respectively and whose transition relation is \({ T }\) then this automaton recognises \({ L }\). In particular if \({ E }\) is finite then we have a finite state automaton which recognises \({ L }\). We’ll see later that the converse is also true, that if there is a finite state automaton which recognises \({ L }\) then \({ E }\) is finite, but first it may be helpful to do an example.

An example

We can use the construction above to find a finite state automaton for the language of integers.

What are the equivalence classes of strings of the characters 0, 1, …, 9, and -?

We always have the equivalence class of the empty string, whose continuation set is just the language. There is no other string whose continuation set has all integers as members so the empty string is the only member of this language.
There is also the equivalence class of the string 0. The only continuation of 0 is the empty string. There are no other strings whose only continuation is the empty string, so 0 is the only member of this equivalence class.
We also have the equivalence class of -. The continuations are just the strings representing positive integers. There is no other string with the same continuation set so - is the only member of this equivalence class.
There is also an equivalence class whose members are all non-zero integers. These all have all strings of digits as their continuations. That includes an empty string of digits.
Finally, there is an equivalence class consisting of those strings with no continuations. These are the strings with some sort of syntax error, like 0--.

These are all the equivalence classes. We’ve just seen that we can form a finite state automaton whose states correspond to the equivalence classes. The only initial state is the one corresponding to the equivalence class of the empty set. The accepting states are the ones for which the empty list is a valid continuation, which in this case is the class of 0 and the class of non-zero integers.

There are two reasonable ways to label these states. One is with the equivalence classes and the other is with the continuations. We saw in the last section that there is a bijective function, which we called \({ h }\) from equivalence classes to continuations, so either will work. I find it easier to understand the version with states labelled by continuations. There is one slightly tricky point. We have one state where the set of continuations is empty and one where the only member of the set of continuations is the empty list. We can’t label both of them empty. I’ll use that label for the second one, and the label error for the first one, since that’s the state we’re in if there has been a syntax error in the input. With these choices the finite state automaton for the integer language is the one with the accompanying diagram.

A strongly deterministic finite state automaton for the integers

This finite state automaton is deterministic, and in fact strongly deterministic. This is not an accident. The construction from the previous section always gives a strongly deterministic automaton, and indeed gives one with as few states as possible.

The converse

We’ve seen that if the set \({ E }\) of equivalence classes is finite then there is a finite state automaton which recognises the language. In this section we’ll see that the converse is also true. If a language is recognised by a finite state automaton then \({ E }\) is finite.

Suppose we have a finite state automaton with \({ S }\) as its set of states which recognises the language \({ L }\). As usual, \({ F }\) will be the set of accepting states. As before I’ll denote the set of all lists of members of \({ A }\) by \({ B }\). For each \({ w }\) in \({ B }\) let \({ i ( w ) }\) be the subset of \({ S }\) consisting of those states which the automaton can be in after reading \({ w }\). There could be more than one member of \({ i ( w ) }\) if the automaton is non-deterministic, and there could be none if it is not strongly deterministic. If it is strongly deterministic then \({ i ( w ) }\) has exactly one member. Let \({ D }\) be the range of \({ i }\).

If \({ z }\) is a continuation of \({ w }\) then \({ w z }\) is a member of the language and so must be accepted by the automaton, so there must be a computational path for \({ z`}\) from a member of \({ i ( w ) }\) to a member of \({ F }\). Conversely, if there is such a path then \({ w z }\) must be a member of the language, so \({ z }\) is a continuation of \({ w }\). In particular the set \({ f ( w ) }\) of continuations of \({ w }\) depends only on \({ i ( w ) }\). So if we define \({ j }\) to be the function from the range of \({ i }\) in \({ P S }\) to \({ C }\) which takes a subset \({ H ⊆ S }\) to the set of strings which can be accepted by the automaton from some state of \({ H }\) then \({ f = j ∘ i }\). Since we already have \({ f = h ∘ g }\) we have \({ j ∘ i = h ∘ g }\). Let \({ k = j ∘ h ^ { - 1 } }\). This makes sense since \({ h }\) was already shown to be bijective. Then \({ k ∘ i = g }\). \({ k }\) is a surjective from the range of \({ i }\), which is a subset of \({ P S }\), to \({ E }\). \({ S }\) is finite, so \({ P S }\) is finite, so the range of \({ i }\) is finite, and therefore \({ E }\) is finite.

In fact we can be more precise. If there are \({ n }\) states then there are \({ 2 ^ n }\) members of \({ P S }\) and so at most \({ 2 ^ n }\) members of the range of \({ i }\) and then at most \({ 2 ^ n }\) members of \({ E }\). If the finite state automaton is strongly deterministic then every member of the range of \({ i }\) has a single member and there are only \({ n }\) such subsets of \({ S }\), so we get the much stronger result that \({ E }\) has at most \({ n }\) members. In particular every strongly deterministic finite state automaton which recognises \({ L }\) has at least as many states as \({ E }\) has members. In an earlier section we constructed a strongly deterministic finite state automaton with exactly that many members. We can now see that that automaton is minimal, in the sense that it has the smallest possible number of states for a strongly deterministic automaton which recognises \({ L }\).

We can now state one form of the Myhill-Nerode theorem, that if \({ L }\) is a language and \({ E }\) is the set of equivalence classes of lists with respect to \({ L }\), equivalence being defined by saying that lists are equivalent if they have the same set of continuations, then \({ L }\) is regular if and only if \({ E }\) is finite.