Linguistic examples

Generating elements of a language can be considered as a non-deterministic computation. Specifically, we can treat the problem of recognising whether a given list of tokens is an element of the language in the following way.

States correspond to lists of symbols or tokens.
The initial state is a list with just the start symbol.
The allowed actions are replacing a terminal symbol from the list by a token, or a non-terminal by one of its alternates.
Successful termination happens when the list matches the input.
Unsuccessful termination is when the list consists only of tokens, and so no further actions are possible, but the list does not match the input.

As usual, an unsuccessful termination just means that the particular choices of actions in this computational path did not succeed. Some other set of choices might.

As with the puzzle example, it’s possible, and useful, to restrict the allowed actions in such a way that if the original problem is solvable then so is the restricted one. In this case we can do this by insisting that expansion is only applied to the first symbol in the list. Using the last would work equally well, but we’ll use the first.

As an example, consider the following grammar for the language of balanced parentheses.

ok : | "(" ok ")" ok

and the input “(()())”. This has balanced parentheses and so should be recognised. The full tree is infinite, so we’ll have to examine it in parts.

The first figure shows the possible states after four steps, and the computational paths which lead to each one. In this case there are only two possible expansions of ok, the only nonterminal symbol, so we have a binary tree, i.e. one where each node has at most two children. There are two cases where there are no children, the strings “” and “()”, which represent unsuccessful terminations, since neither of them is our input string. We could continue, but the tree is growing exponentially and some pruning is advisable. We can notice, for example, that proceeding from the root and following the right branch and then the left leads us to “()ok”. This computation is doomed. No matter what choices we make from this point on we can only get strings starting with “()”, which can never match the given input “(()())”. So we can ignore that branch. We can, of course, also ignore the original left branch, which terminated unsuccessfully immediately. So the only computational paths which are potentially relevant are those which start by going right, and then right again. If we go right one more time though then we reach “(((ok)ok)ok)ok”. Any possible continuation from here will lead to a string starting with “(((”, which also cannot match our input. So we need only consider the part of the tree where we choose right, right and left initially, leading to “(()ok)ok)”. The next diagram shows the part of the tree starting from that node.

In this new diagram we see another unsuccessful termination, for the branch ending in “(())”. If fact the whole left branch is doomed, since it can only generate strings starting with “(())”, as is the the right subbranch of the right branch, since it can only generate strings starting with “(()((”. We can therefore restrict our attention to the part of the tree from the left subbranch of the right branch, starting from the state “(()()ok)ok”, which is shown in the next diagram.

On the far left of that diagram we see the string “(()())”, which matches our input string, so this computational path succeeds and the string is recognised.

A single successful computational path is enough so we don’t need to worry about what would happen to any of the computations which have not terminated, although it turns out none of them would be successful.

I’ve described this as a recogniser, but it’s not too difficult to turn it into a parser. The only change which is needed is to keep track of the computational path which led us to each state, since this contains the information about how each symbol was expanded, which is all we need in order to construct a parse tree.

What would have happened if the input string was invalid, i.e. did not have balanced parentheses? This depends on whether we prune the tree systematically as in the example above. If we do then eventually all computational paths will terminate unsuccessfully. If we don’t then the computation will simply run forever. For valid inputs pruning makes the algorithm more efficient, and certainly easier to produce diagrams for, but isn’t essential for it to recognise the input as valid.

In our example the tree was a binary tree. This happened for two reasons. First, all of our non-terminal symbols, of which in fact there was only one, had more than two alternates. Second, none of our nonterminal symbols, which were “(” and “)” had more than one token belonging to it. In fact they each had only one, but we would still have got a binary tree even if they’d had two.

If we consider other grammars the number of alternates for a non-terminal may be larger than two, but it will always be finite. The number of tokens belonging to a symbol could be infinite though, in which case we will have nodes with infinitely many children. We’ll see how to deal with this later.