A parser for the prefix language

It’s much easier to write a parser for a prefix or postfix language than an infix one. In fact here’s a simple parser for the prefix version of our language, without the unnecessary parentheses.

The boring bit of the parser is the lexical analyser, the bit which separates the input stream into the three Boolean operators ∧, ∨ and ¬ and the module names. We’ll call these tokens. This is easy for our language, since we’ll interpret any any string of characters other than ∧, ∨ and ¬ as a module name. Checking whether the string corresponds to some actually existing module is not the lexical analyser’s problem, although we could have made it part of the lexical analyser’s job. How much work, if any, the lexical analyser should do is a design decision. I will talk later about how a lexical analyser splits input into tokens but for now we’ll just assume we have a lexical analyser and that our input is split into tokens, which the parser reads in one at a time.

Slightly simpler than an actual parser is a grammar checker. The only data structure this needs is a single integer, which we’ll call the counter. The is initialised to 1. When the grammar checker reads an ∧ or an ∨ it increments the counter. When it reads a module name it decrements the counter. When it reads a ¬ it does nothing. If the value of the counter reaches 0 at the end of the input, but not before, then the input is grammatically correct. Otherwise it isn’t. Here’s the input “∨ ∧ Probability Statistics ∧ Algebra Geometry” together with the value of the counter at each point in the input:

1 ∨ 2 ∧ 3 Probability 2 Statistics 1 ∧ 2 Algebra 1 Geometry 0

In the theory of formal languages what I’ve just called a grammar checker is called a recogniser.

We can turn this into an abstract syntax tree by scanning through for sequences of tokens where the value of the counter remains at least as high as its value at the start of the sequence until the end of the sequence, where it’s 1 lower. The seven sequences with this property in our example are

1 ∨ 2 ∧ 3 Probability 2 Statistics 1 ∧ 2 Algebra 1 Geometry 0
2 ∧ 3 Probability 2 Statistics 1
1 ∧ 2 Algebra 1 Geometry 0
3 Probability 2
2 Statistics 1
2 Algebra 1
1 Geometry 0

For each of them we have a node in our tree, which we will label with the first token of the sequence. Whenever one sequence contains in another we’ll draw an arrow from the first to the second, unless there’s an intermediate sequence, i.e. one which contains the second and is contained in the first. The result is the tree in the accompanying figure, which we saw earlier for the infix version of the same statement.

Syntax tree for ∨ ∧ Probability Statistics ∧ Algebra Geometry

The parse tree is similar to the one we saw earlier, but the order of the children is different, as shown in the accompanying diagram.

Parse tree for ∨ ∧ Probability Statistics ∧ Algebra Geometry