We need a bit of terminology in order to talk about this and other formal languages.
Languages have an alphabet consisting of tokens. These might be single characters or might be strings. Converting a list of characters into a list of tokens is the job of the lexical analyser. The tokens belong to sets, called terminal symbols. There are also nonterminal symbols, which we’ll get to later.
In the example above the alphabet consists of 17 tokens, each a single character, the decimal point and 16 digits, and there’s a terminal symbol for each of these.
In this case the lexical analyser doesn’t really have anything to do,
but we could have divided the work between the lexical analyser and the
parser differently. For example, we could have put all the digits into a
single token, DIGIT
, in which case the grammar would have
looked like
number : integer | "." integer | integer "." | integer "." integer
integer : DIGIT | integer DIGIT
The standard notational convention is that names of terminal symbols are written in upper case. The parser now doesn’t need to concern itself with which characters are digits as that’s handled by the lexical analyser.
We could shift even more work from the parser to the lexical
analyser, by deciding that uninterrupted strings of digits are tokens,
with the symbol INTEGER
. The parsers grammar is then just a
single line.
number : INTEGER | "." INTEGER | INTEGER "." | INTEGER "." INTEGER
In this case, unlike the previous ones, there are infinitely many tokens. This is allowed, but we’ll still insist on having only finitely many symbols. When there are only finitely many tokens it’s clear that we can figure out which symbol a token belongs to just by comparing it to the finite lists for each symbol. When we have finitely many symbols but infinitely many tokens there must be at least one symbol with finitely many tokens, so we can’t just examine the lists. There needs to be an algorithm for recognising tokens. We’ll discuss what kinds of algorithms are allowed later in the chapter on regular languages.
The symbols we’ve just discussed, the ones which are sets of tokens
are called terminal symbols or just terminals. There
are other symbols, conveniently called nonterminal symbols or
just nonterminals. In the original grammar above
number
, integer
, and digit
were
nonterminals. In fact the nonterminals are always precisely the things
you see listed on the left hand sides of all the grammar rules.
One of these symbols has a special status. It is called the start
symbol. In the example above number
is the the start
symbol. The notational convention I’m using is that the start symbol is
always the one on the left hand side of the first rule in the list.
A context free grammar is a finite set of grammar
rules, also sometimes called production rules. Each of
these grammar rules describes possible ways to build up a nonterminal
symbol from other symbols, which might be terminal or nonterminal. For
example, an integer
is either just a digit
or
is an integer
followed by a digit
. The
|
character separates the distinct possibilities in each
case.
You can see from the rule for integer
that when I wrote
that “each of these grammar rules describes possible ways to build up a
nonterminal symbol from other symbols” I didn’t mean the word “other” to
exclude the possibility of the same symbol occurring on both the left
hand side and the right hand side of a rule. In other words, rules can
be recursive. Indeed almost all interesting languages have recursive
rules.