A parser for the prefix language

It’s much easier to write a parser for a prefix or postfix language than an infix one. In fact here’s a simple parser for the prefix version of our language, without the unnecessary parentheses.

The boring bit of the parser is the lexical analyser, the bit which separates the input stream into the three logical connectives “and”, “or” and “not” and the module names. We’ll call these tokens. With my rather drastic requirement that module names are single words starting with capital letters this part is easy, but for many interesting languages it is more difficult. I will talk later about how a lexical analyser splits input into tokens but for now we’ll just assume we have a lexical analyser and that our input is split into tokens, which the parser reads in one at a time.

Slightly simpler than an actual parser is a grammar checker. The only data structure this needs is a single integer, which we’ll call the counter. The is initialised to 1. When the grammar checker reads an “and” or an “or” it increments the counter. When it reads a module name it decrements the counter. When it reads a “not” it does nothing. If the value of the counter reaches 0 at the end, but not before, then the input is grammatically correct. Otherwise it isn’t. Here’s the input “or and Probability Statistics and Algebra Geometry” together with the value of the counter at each point in the input:

1 or 2 and 3 Probability 2 Statistics 1 and 2 Algebra 1 Geometry 0

We can turn this into a abstract syntax tree by scanning through for sequences of tokens where the value of the counter remains at least as high as its value at the start of the sequence until the end of the sequence, where it’s 1 lower. The seven sequences with this property in our example are

1 or 2 and 3 Probability 2 Statistics 1 and 2 Algebra 1 Geometry 0
2 and 3 Probability 2 Statistics 1
1 and 2 Algebra 1 Geometry 0
3 Probability 2
2 Statistics 1
2 Algebra 1
1 Geometry 0

For each of them we have a node in our tree, which we will label with the first token of the sequence. Whenever one sequence contains in another we’ll draw an arrow from the first to the second, unless there’s an intermediate sequence, i.e. one which contains the second and is contained in the first. Depending on our conventions we can label the node with the whole phrase or just the first token. The version where we give just the first token is one of the abstract syntax tree diagrams we saw earlier. The version with the whole phrase is often called the “parse tree”, although some authors use those terms interchangeably.