Constructing a parser from a grammar

Can we construct a parser from a grammar description of the type we’ve just described? Yes. We can even do so in a way which is reasonably efficient. Unfortunately that way is also very complicated to describe. If we’re willing to sacrifice efficiency can we do it in a way which is relatively simple to describe? Yes, but there is one way in which this parser will be unsatisfactory.

Recognisers are easier to construct and understand than parsers so we’ll start with a recogniser.

It’s helpful to think in terms of nondeterministic computation. Normally we expect an algorithm to tell us what to do at each stage. Our grammar rules are like an algorithm in that we proceed by steps from a well defined initial state, one where we have a list consisting of just the start symbol. Each step takes the first symbol from the list and replaces it with one of the tokens corresponding to that symbol or expands it into one of the lists of symbols on the right hand side of a grammar rule for that symbol, depending on whether it’s terminal or nonterminal. If we expanded a terminal to a token we check whether that token matches the next input token. If it does then we remove it and if it doesn’t, either because there are no input tokens left or because the next one is different from the one we chose, then the computation terminates unsuccessfully. If we ever reach a point where there are no symbols left to process we check whether there are any input tokens left. If there are then the computation terminates unsuccessfully. If not then it terminates successfully.

Successful termination in the description above means that the input is recognised as valid, i.e. that it can be generated by the given grammar. Unsuccessful termination doesn’t mean that the input can’t be generated though, merely that it wasn’t generated by the particular choices we made. There might be other choices which would generate it.

This is a recogniser rather than a parser but it’s not too difficult to convert it into a parser. Each step in the algorithm processes a symbol and we just need to put this symbol in the correct position in the parse tree, which is the root in the case of the start symbol or as the child of whichever symbol we expanded to get it in the case of all the other symbols.

There’s a trick to turn nondeterministic computations into deterministic ones. Instead of making any particular choice at each step we make all of them. More precisely, starting from the initial state we write down all states we can reach in a single step. Then we write down all states which can be reached from one of those states, also recording the path that led us to those states. Then we write down all the ones we could reach in a single step from those, again recording the path that led to each one. Whenever we write down a state we check whether it satisfies the terminating condition. If so then we check whether the computation terminated successfully or unsuccessfully. If it terminated successfully then we’re done. We have the full path which led us to that state. We’re in the same situation we would be in if we had a “lucky guesser”, who made the optimal choice at each stage, except that it will have taken us longer to get there. If we’re in one of the unsuccessful terminating states then we don’t need to, and indeed can’t, continue looking for continuations of that computational path but we can consider continuations from the other states on our list, if there are any. Only if all of our paths reach a dead end does the computation terminate unsuccessful. Typically it doesn’t terminate at all though.

This algorithm can conveniently by represented by a tree, with the initial state at the root and nodes for each possible computational path and arrows from each of those to its one-step continuations, which implies branching at each node where there are multiple choices for the next step. Unless we specify an upper bound on the number of steps this tree could well be infinite. It is for the parsing problem we just considered, which is why I won’t attempt to draw the tree.

Does this work? That depends on whether the available choices at each stage are finite and also what you mean by work. If there are only finitely many choices available in each state and there is a solution, i.e. a computational path which terminates successfully then this method will find it. In fact the method can be modified to cope with an infinite variety of choices, as long as it’s not too infinite. What the method can’t be relied on for is to tell us when there is no solution. It could tell us, if all paths have reached a dead end. It’s certainly possible though that there is no solution but there’s always something else to try so the algorithm will just run forever.

For the parsing problem you should not do this. There are algorithms which are much faster and which are guaranteed to tell you when the problem has no solution, i.e. when the list of tokens which is your input does not belong to the language. You should use one of those instead. They aren’t covered in this module though. Still, the idea of nondeterministic computation is one which we will meet again in this module. It’s not always this useless.