Back to the beginning

In the introduction I informally introduced a language for module selection rules. I can now provide an actual grammatical description. In fact I can provide more than one. The typical way to do things in practice would be with two stages, a lexical analyser and a parser. If the lexical analyser is doing the work of breaking the input into tokens then our grammar, in the same notation as that of the POSIX standard can be as simple as

%token MODULE
/* any possible module name */

%token And Or Not
/* 'and', 'or', 'not' */

%start statement

%%

statement  : statement2
           | statement Or statement2
           ;

statement2 : statement3
           | statement2 And statement3
           ;

statement3 : statement4
           | Not statement4
           ;

statement4 : MODULE
           | ( statement )
           ;

In fact even simpler grammars are possible but this one is unambiguous and always generates the correct parse tree. The different levels of statements ensure this. For example, “not Algebra and Geometry” will be parsed as if it were “((not Algebra) and Geometry” rather than as “(not (Algebra and Geometry))” because “not” can only appear before a level 4 statement. “Algebra”, as a module name, is a level 4 statement but “Algebra and Geometry is a level 2 statement.

This language is simple enough that we could dispense with the separate lexical analysis step entirely though and work directly with characters rather than strings as tokens. A grammar which does this is

%start statement

%%

statement  : statement2
           | statement spaces Or spaces statement2
           ;

statement2 : statement3
           | statement2 spaces And spaces statement3
           ;

statement3 : statement4
           | Not spaces statement4
           ;

statement4 : module
           | '(' statement ')'
           | '(' spaces statement ')'
           | '(' statement spaces ')'
           | '(' spaces statement spaces ')'
           ;

spaces     : ' '
           | spaces ' '
           ;

module     : capital
           | capital letters
           ;

letters    : letter
           | letters letter
           ;

capital    : 'A' | 'B' | 'C' | 'D' | 'E' | 'F' | 'G' | 'H' | 'I'
           | 'J' | 'K' | 'L' | 'M' | 'N' | 'O' | 'P' | 'Q' | 'R'
           | 'S' | 'T' | 'U' | 'V' | 'W' | 'X' | 'Y' | 'Z'
           ;

letter     : 'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i'
           | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' | 'r'
           | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z'

And        : 'a' 'n' 'd'
           ;

Or         : 'o' 'r'
           ;

Not        : 'n' 'o' 't'
           ;

Lexical analysers use spaces or other whitespace to separate tokens but don’t typically pass this whitespace on to the parser so if we’re dispensing with the parser then we need to handle this in the parser.