We need a bit of terminology in order to talk about this and other formal languages.
Languages have an alphabet consisting of tokens.
These might be single characters or might be strings. If they’re not
single characters then a lexical analyser is needed to group characters
into tokens. The grammar above starts with some information about the
tokens in the the language bc
accepts. =
and
+=
are both tokens, for example. Each token is assigned to
one, and only one, group, called its symbol. In the example
=
and +=
are in the group
ASSIGN_OP
. Assigning tokens to symbols part of the lexical
analyser’s job. While we may allow infinitely many tokens there should
only be finitely many symbols. In the bc
grammar
STRING
is a symbol with infinitely many tokens. Although
it’s not specified in the formal part of the grammar the lexical
analyser recognises almost any string of characters as a
STRING
. Some tokens are likely to be in a group by
themselves, in which case people tend to blur the token/symbol
distinction. Technically, for example, there are two
NEWLINE
s, the token and the symbol, a set of tokens whose
only element is the NEWLINE
token. A lot of people blur the
distinction even when there are multiple symbols in a group and use the
words symbol and token as interchangeable.
The symbols we’ve just discussed, the ones which are groups of tokens
are called terminal symbols or just terminals. There
are other symbols, conveniently called nonterminal symbols or
just nonterminals. In the bc
example
program
, input_item
and
semicolon_list
are nonterminals. In fact the non-terminals
precisely the things you see listed on the left hand sides of all the
grammar rules after the line
%%
One of these symbols has a special status. It is called the start
symbol. In the example above program
is the the start
symbol. You can tell because of the line
%start program
Not everyone labels the start symbol. If it’s not labelled then the
convention is that it’s the one on the left hand side of the first
grammar rule. A context free grammar is a finite set of
grammar rules, also sometimes called production rules.
Each of these grammar rules describes possible ways to build up a
nonterminal symbol from other symbols, which might be terminal or
nonterminal. For example a program
can be an
EOF
symbol, which is just the end of file marker, or an
input_item
followed by a program
. The
EOF
symbol is terminal while input_item
and
program
are nonterminal. An input_item
can be
a semicolon_list
followed by a NEWLINE
or a
function
. NEWLINE
is a terminal while
semicolon_list
and function
are non-terminal.
The |
character separates the distinct possibilities in
each case.
You can see from the rule for program
that when I wrote
that “each of these grammar rules describes possible ways to build up a
nonterminal symbol from other symbols” I didn’t mean the word “other” to
exclude the possibility of the same symbol occurring on both the left
hand side and the right hand side of a rule. In other words, rules can
be recursive.
As might be expected from a field where people from radically
different fields, like computer science, linguistics and mathematics,
not everyone uses the same terminology and the specifics of how grammar
rules are written can vary a lot. Most differences are minor but one is
quite significant: whether the “other symbols” referred to above, used
to build up a nonterminal symbol, could include no symbols. The three
conventions in common use are to allow this in all cases, to allow it
only for the start symbol, and do allow it for no symbols. The authors
of the specification belong to the tradition in which this is allowed
for all symbols, as you can see from the rules for
semicolon_list
, statement_list
,
opt_parameter_list
and a number of others. The
/* empty */
things you see there, like everything between a
/*
and a following */
, are comments, which are
there to draw attention to the fact that each of these symbols could be
built from no symbols, and the meaning would be the same if they were
omitted.