Back to the beginning

In the introduction I informally introduced a language for module selection rules. I can now provide an actual grammatical description. In fact I can provide more than one. The typical way to do things in practice would be with two stages, a lexical analyser and a parser. If the lexical analyser is doing the work of breaking the input into tokens then our grammar, in the same notation as before, is

statement  : statement2
           | statement "∨" statement2
statement2 : statement3
           | statement2 "∧" statement3
statement3 : statement4
           | "¬" statement4
statement4 : MODULE
           | "(" statement ")"

The non-terminal symbols are the “∨”, “∧”, “¬”, “(” and “)”, each with that string as the only token belonging to the symbol, and MODULE, which is a symbol containing all possible module names. There are four different nonterminals, the four types of statements, and statement is the start symbol.

Separating out the input into tokens and assigning them to terminal symbols is, as usual, the job of the lexical analyser. In practice we tend to give the syntactic analyser another job as well, removing unnecessary white space, i.e. space characters, tabs, and new lines. This allows us to write more readable input without burdening the parser.

In fact even simpler grammars are possible but this one is unambiguous and always generates the correct parse tree. The different levels of statements ensure this. For example, “¬ Algebra ∧ Geometry” will be parsed as if it were “((¬ Algebra) ∧ Geometry” rather than as “(¬ (Algebra ∧ Geometry))” because “¬” can only appear before a level 4 statement. “Algebra”, as a module name, is a level 4 statement but “Algebra ∧ Geometry is a level 2 statement.

This language is simple enough that we could dispense with the separate lexical analysis step entirely though and work directly with characters rather than strings as tokens. A grammar which does this is

statement  : statement2 | statement spaces "∨" spaces statement2
statement2 : statement3 | statement2 spaces "∧" spaces statement3
statement3 : statement4 | "¬" spaces statement4
statement4 : module | "(" statement ")" | "(" spaces statement ")"
           | "(" statement spaces ")" | "(" spaces statement spaces ")"
ows        : | spaces
spaces     : " " | spaces " "
module     : words
words      : word | words spaces word
word       : letter | letters letter
letter     : "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I"
           | "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R"
           | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "0"
           | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
           | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i"
           | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r"
           | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z"

ows is an abbreviation for “optional white space”. For simplicity I’ve assumed that this consists solely of spaces, not tabs or line breaks, and the module names don’t use any characters other than English letters or numbers.