Parse trees

The process described above, splitting a statement up into successively smaller phrases until we get to the simplest possible components, is called parsing. A common way to describe the result, both for natural and for formal languages, is with tree diagrams. There are two slightly different ways to do this, which are perhaps best illustrated by examples.

The first three accompanying figures give abstract syntax trees for three different parsings of “Probability ∧ Statistics ∨ Algebra ∧ Geometry”. These are followed by three parse trees for the same three parsings.

Syntax tree for ((Probability ∧ Statistics) ∨ (Algebra ∧ Geometry))

Parse tree for ((Probability ∧ Statistics) ∨ (Algebra ∧ Geometry))

A tree has elements called nodes and has arrows from one node to another. The nodes with no arrows going out are called the leaves of the tree. There is a single node with no arrows coming in, which is called the root. In the case of the syntax trees the leaves are all labelled by module names and the other nodes are all labelled by Boolean operators. In the case of the parse trees the leaves are labelled either by module names or by Boolean operators and the other nodes are unlabeled.

These visual representations are nice, but all computers, and many humans, are blind. It’s possible to describe the same information in a different way, with fully parenthesised expressions. The fully parenthesised expressions corresponding trees above are

((Probability ∧ Statistics) ∨ (Algebra ∧ Geometry))

((Probability ∧ (Statistics ∨ Algebra)) ∧ Geometry)

(Probability ∧ ((Statistics ∨ Algebra) ∧ Geometry))

Parse trees correspond exactly to fully parenthesised expressions. Each pair of balanced parentheses corresponds to a non-leaf node in the diagram. The internal representation a computer would use for a tree data structure isn’t any of these. The visual description and the parenthesised expressions are just for humans.

The fact that there are two possible abstract syntax trees for “Probability ∧ (Statistics ∨ Algebra) ∧ Geometry”, depending on which ∧ has higher precedence, shows that our grammar isn’t fully unambiguous, even after specify the precedence of operators.

When parsing statements in a formal language with a program one often wants to construct a data structure which mirrors this structure. For simple enough languages though it may be possible, as we’ll see, to use simpler data structures than a tree.