Pumping lemma

One consequence of the equivalence discussed in the preceding section is that we have six different ways to show that a language is regular:

give a left regular grammar which generates its members
give a right regular grammar which generates its members
give a finite state automaton which recognises its members
give a deterministic finite state automaton which recognises its members
give a generalised finite state automaton which recognises its members
give a regular expression which matches its members

That’s nice, but we’ve seen zero ways so far of showing that a languages isn’t regular. That’s a rather serious gap and none of the descriptions we have are of much help here. We can hardly list all left regular grammars, for example, and check that none of them generate the language. In order to prove that a language isn’t regular you need to identify a property which all regular languages share and then show that this language does not have that property. There are two properties which people use for this.

One of these is the subject of the Myhill-Nerode theorem, which we’ll discuss in the next section. The other is that of the Pumping lemma, which I’ll discuss in this section. The Myhill-Nerode theorem is better in nearly all respects than the Pumping lemma. It provides a necessary and sufficient condition for regularity while the Pumping lemma just provides a necessary condition and, although this is necessarily somewhat subjective, I find it much easier to use. There are two reasons to introduce the Pumping lemma anyway though. One is that it’s more popular. If you ever see someone outside of this module proving a language is not regular they will probably be doing so using the Pumping lemma so you should know what it is. The second reason is that there are two Pumping lemmas, one for regular languages and one for context free languages. It’s the first of these that we’re discussing in this chapter. The second will be discussed in the next chapter. The Myhill-Nerode theorem doesn’t have such a nice generalisation to context free languages so we will need the second version of the Pumping lemma in order to prove that various languages are not context free in the next chapter. For this reason not only will I give a proof of the Pumping lemma for regular languages but I’ll give one which, unlike the usual proof, generalises well to context free languages.

I haven’t introduced a notation for concatenation yet. We’ll be encountering a lot of them in the remainder of this chapter and it’s convenient to have a notation for them. I’ll take the simplest option and denote concatenation by concatenation. In other words, if \({ u }\) and \({ v }\) are lists of tokens then \({ u v }\) will be the list obtained by concatenating \({ u }\) and \({ v }\), in that order. I’ll also write \({ u ^ n }\) for concatenation of \({ n }\) copies of \({ u }\).

The statement of the lemma

With those preliminaries out of the way we can proceed to the statement of the Pumping lemma, which is a bit weird. This will require some terminology. We say that a natural number \({ p }\) is a pumping length for a language \({ L }\) is for every member \({ w }\) of \({ L }\) of length at least \({ p }\) can write \({ w }\) as a concatenation of three lists \({ a }\), \({ b }\) and \({ c }\), in that order, i.e. \({ w = a b c }\), with the following properties:

\({ b }\) is not the empty list,
\({ a b }\) has length at most \({ p }\), and
for every natural number \({ n }\) the list \({ a b ^ n c }\) is a member of \({ L }\).

Note that \({ p }\) depends on \({ L }\) but not on \({ w }\). The pumping length is a property of the language, not any particular member.

A language is said to have the pumping property if it has a pumping length. The Pumping lemma says that every regular language has the pumping property. It does not say that every language with the pumping property is regular, and indeed that’s not true.

There are really two pumping lemmas for regular languages, a left pumping lemma and a right pumping lemma. For some reason everyone seems to state the version above, but there’s also a version which is identical except that it’s \({ b c }\) which has length at most \({ p }\).

An example

Before giving a proof I’ll do an example to show how the Pumping lemma can but used to show that a language isn’t regular.

Earlier we considered a language whose members are all strings of the form some number of x’s followed by some number of y’s. This was a regular language. We know this because we’ve seen a left regular grammar for it, a right regular grammar for it, a deterministic finite state automaton for it, and a regular expression for it. Any one of these would suffice to prove that it is regular. We therefore can’t expect to use the Pumping lemma to show that it’s not regular. Consider, though, the language of strings which are some number of x’s followed by the same number of y’s. We can show, using the Pumping Lemma, that this language is not regular.

The proof is by contradiction. Suppose the language is regular. Then it has the pumping property, i.e. there is some pumping length \({ p }\) for this language. Let \({ w }\) be the string with \({ p }\) x’s followed by \({ p }\) y’s. It belongs to the language so it can be written as \({ a b c }\) as in the definition of the pumping length. Because \({ a b }\) is of length at most \({ p }\) and occurs at the beginning of \({ w }\) both \({ a }\) and \({ b }\) must be strings of x’s. Consider the string \({ a c }\), thought of as \({ a b ^ 0 c }\) This is the case \({ n = 0 }\) of \({ a b ^ n c }\). and so is a member of \({ L }\). \({ b }\) is of positive length and consists solely of x’s so by removing it we now have a string in the language with fewer than \({ p }\) x’s followed by \({ p }\) y’s. But the language is the language of strings where some number of x’s is followed by the same number of y’s, so this is impossible.

The name of the Pumping lemma comes from the possibility of taking \({ n }\) to be large, generating arbitrarily long language elements by a process of “pumping”. We could have done that here but we didn’t need to. Instead of lengthening our string to get a contradiction we shortened it.

This language, which we’ve just shown not to be regular, is a sublanguage of our original xy language, which we already knew to be regular. It follows that not every sublanguage of a regular language is regular.

Finite languages

Students occasionally get confused by one point about the Pumping lemma. Finite languages are always regular. The Pumping lemma appears to allow us to generate arbitrarily long strings in a regular language. How is this not a contradiction? The resolution of this seeming paradox is that the Pumping lemma only says something about sufficiently long strings, specifically those greater than the pumping length of the language. A finite language has a pumping length equal to the length of its longest string. There are no strings \({ w }\) in the language with length longer than that so the statement about being able to split \({ w }\) into \({ a }\), \({ b }\) and \({ c }\) with the given properties doesn’t actually apply to any string and is vacuously true.

The proof of the lemma

Suppose \({ L }\) is a regular language. It must then have a left regular grammar. Let \({ p }\) be the number of non-terminal symbols in this grammar. I will show that \({ p }\) is a pumping length for \({ L }\). Suppose \({ w ∈ L }\) is of length \({ m }\), which we assume is at least \({ p }\). The parse tree for \({ w }\) is of a particularly simple form. The root has one child. Almost every other node has two children, one of which is a leaf with a terminal symbol and the other of which is a non-terminal symbol, also with two children. The one exception is that the non-terminal symbol which gets expanded to the empty list has no children and is therefore also a leaf.

Our parse tree has \({ m }\) leaves labelled by terminal symbols, each with a distinct parent, labelled by a non-terminal symbol other than the start symbol, and the one non-terminal symbol which is a leaf, for \({ m + 1 }\) non-terminals other than the start symbol. Since \({ m + 1 }\) is greater than \({ p }\) some non-terminal is repeated symbol. There may well be more than but there must be one within the last \({ p + 1 }\) symbols. We can take the segment of the tree between those symbols, including the one closest to the root and excluding the one farthest from the root, and repeat this as many times as we want to get parse trees for valid lists in the language. The part of the tree below the repeated segment is our \({ a }\), the repeated part is \({ b }\) and the part above is \({ c }\).

The accompanying diagrams illustrate this construction on the string 2023 in the integer language. The string 02 between 2 and 3 can be repeated arbitrarily many times. The parse tree for the original string is shown along with the trees for zero repetitions and two repetitions.

To get the version of the Pumping lemma where it’s \({ b c }\) which is of length at most \({ p }\) we would apply the same construction to a right regular grammar.