Terminology

We need a bit of terminology in order to talk about this and other formal languages.

Languages have an alphabet consisting of tokens. These might be single characters or might be strings. Converting a list of characters into a list of tokens is the job of the lexical analyser. The tokens belong to sets, called terminal symbols. There are also nonterminal symbols, which we’ll get to later.

In the example above the alphabet consists of 17 tokens, each a single character, the decimal point and 16 digits, and there’s a terminal symbol for each of these.

In this case the lexical analyser doesn’t really have anything to do, but we could have divided the work between the lexical analyser and the parser differently. For example, we could have put all the digits into a single token, DIGIT, in which case the grammar would have looked like

number : integer | "." integer | integer "." | integer "." integer
integer : DIGIT | integer DIGIT

The standard notational convention is that names of terminal symbols are written in upper case. The parser now doesn’t need to concern itself with which characters are digits as that’s handled by the lexical analyser.

We could shift even more work from the parser to the lexical analyser, by deciding that uninterrupted strings of digits are tokens, with the symbol INTEGER. The parsers grammar is then just a single line.

number : INTEGER | "." INTEGER | INTEGER "." | INTEGER "." INTEGER

In this case, unlike the previous ones, there are infinitely many tokens. This is allowed, but we’ll still insist on having only finitely many symbols. When there are only finitely many tokens it’s clear that we can figure out which symbol a token belongs to just by comparing it to the finite lists for each symbol. When we have finitely many symbols but infinitely many tokens there must be at least one symbol with finitely many tokens, so we can’t just examine the lists. There needs to be an algorithm for recognising tokens. We’ll discuss what kinds of algorithms are allowed later in the chapter on regular languages.

The symbols we’ve just discussed, the ones which are sets of tokens are called terminal symbols or just terminals. There are other symbols, conveniently called nonterminal symbols or just nonterminals. In the original grammar above number, integer, and digit were nonterminals. In fact the nonterminals are always precisely the things you see listed on the left hand sides of all the grammar rules.

One of these symbols has a special status. It is called the start symbol. In the example above number is the the start symbol. The notational convention I’m using is that the start symbol is always the one on the left hand side of the first rule in the list.

A context free grammar is a finite set of grammar rules, also sometimes called production rules. Each of these grammar rules describes possible ways to build up a nonterminal symbol from other symbols, which might be terminal or nonterminal. For example, an integer is either just a digit or is an integer followed by a digit. The | character separates the distinct possibilities in each case.

You can see from the rule for integer that when I wrote that “each of these grammar rules describes possible ways to build up a nonterminal symbol from other symbols” I didn’t mean the word “other” to exclude the possibility of the same symbol occurring on both the left hand side and the right hand side of a rule. In other words, rules can be recursive. Indeed almost all interesting languages have recursive rules.