Pumping lemma

Just as there is a pumping lemma for regular languages there is one for context free languages. The statement is of a similar kind, but more complicated.

For every context free language there is a natural number \({ p }\) such that every \({ w }\) of \({ L }\) of length at least \({ p }\) can be written in the form \({ w = a b c d e }\) where the lists \({ a }\), \({ b }\), \({ c }\), \({ d }\) and \({ e }\) have the following properties:

\({ b c d }\) is of length at most \({ p }\).
\({ b }\) and \({ d }\) are not both empty.
For every natural number \({ n }\) the list \({ a b ^ n c d ^ n e }\) is a member of \({ L }\).

Application

Consider the language consisting of strings some positive number of x’s, followed by the same number of y’s, followed by the same number of z’s. We met this language earlier as the intersection of two context free languages. If this language is context free then there is a number \({ p }\) as in the statement of the lemma. Let \({ w }\) be the string with \({ p }\) x’s, followed by \({ p }\) y’s, followed by \({ p }\) z’s. This is a member of the language and is of length at least \({ p }\) so there should be strings \({ a }\), \({ b }\), \({ c }\), \({ d }\) and \({ e }\) satisfying the conditions listed in the statement of the lemma. The substring \({ b c d }\) is of length at most \({ p }\) so it could contain x’s or z’s but not both. If we take \({ n = 0 }\) we get the string \({ a c e }\), i.e. \({ a b c d e }\) with \({ b }\) and \({ d }\) removed. If \({ b c d }\) had no x’s then \({ a c e }\) has \({ p }\) x’s and is of length less than \({ 3 p }\). If it \({ b cd }\) had no z’s then \({ a c e }\) has \({ p }\) z’s and is of length less than \({ 3 p }\). There are no members of the language which satisfy either of those conditions though. This contradicts the statement of the lemma, so our assumption that the language is context free must have been false. In particular, we now have an example of two context free languages whose intersection is not context free.

Proof

By assumption our language is context free so it has a phrase structure grammar of the type described previously. Let \({ s }\) be the number of and \({ r }\) be the maximum number of symbols appearing in any rule.

In any parse tree for a member of the language the number of leaves among the descendents of a node is at most \({ r ^ h }\), where \({ h }\) is the maximum of the lengths of the branches starting from that node. The length of branches here is the number of edges in a path from the node to the leaf, which is one less than the number of nodes along that branch.

Let \({ p = r ^ { s + 1 } }\) and let \({ w }\) be a member of \({ L }\) of length at least \({ p }\). The parse tree for \({ w }\) must then have a branch of length at least \({ s }\) from the root node. I’ve written “the” parse tree but there could be more than one if the grammar is ambiguous. What the argument above really shows is that every parse tree for \({ w }\) has such a branch. If there’s more than one parse tree we’ll choose one with as few nodes in its parse tree as possible.

On the parse tree for \({ w }\) choose a branch of maximal length. From what we said above this length is at least \({ s + 1 }\). The number of symbols appearing is on this branch is one more than then length and so is greater than \({ s + 1 }\). If we look at the last \({ s + 1 }\) symbols then one must be repeated. Any such symbol in non-terminal because terminal symbols don’t have children in the parse tree. Choose one, and choose two occurences of it. We’ll call the one closer to the root the outer occurence and the one farther from the root the inner occurence. Let \({ c }\) be the expansion of the inner occurence and let \({ f }\) be the expansion of the outer occurence. Let \({ a }\) be the part of \({ w }\) before \({ f }\) and let \({ e }\) be the part after \({ w }\), so that \({ w = a f e }\). Let \({ b }\) be the part of \({ f }\) before \({ c }\) and let \({ d }\) be the part after \({ c }\), so that \({ f = b c d }\).

Now \({ w = a b c d e }\). \({ f }\), i.e. \({ b c d }\), is of length less than \({ p }\) because the maximal branch length from the outer occurence is at most \({ s + 1 }\). Taking the parse tree for \({ w }\) and replacing the part of the tree descending from the outer occurence with the part of the tree descending from the inner occurence has the effect of replacing \({ f }\) by \({ c }\) in \({ a f e }\) and so gives a parse tree for \({ a c e }\), which must therefore also be a member of the language. This parse tree has fewer nodes than the minimal parse tree for \({ w }\) so \({ a c e }\) is not \({ w }\). In other words, \({ b }\) and \({ d }\) are not both empty. Also, \({ a b ^ 0 c d ^ 0 e }\) is a member of the language. We could also replace the part of the tree descending from the inner occurence with the part descending from the outer occurence, to get a parse tree for \({ a b f d e }\), for \({ a b ^ 2 c d ^ 2 e }\), which must therefore also be a member of the language. This construction is repeatable, so we can replate that \({ c }\) by an \({ f }\) to get \({ a b ^ 3 c d ^ 3 e }\) and so on. In this way we see that \({ a b ^ n c d ^ n e }\) is a member of the language for all natural numbers \({ n }\).