Terminology

We need a bit of terminology in order to talk about this and other formal languages.

Languages have an alphabet consisting of tokens. These might be single characters or might be strings. If they’re not single characters then a lexical analyser is needed to group characters into tokens. The grammar above starts with some information about the tokens in the the language bc accepts. = and += are both tokens, for example. Each token is assigned to one, and only one, group, called its symbol. In the example = and += are in the group ASSIGN_OP. Assigning tokens to symbols part of the lexical analyser’s job. While we may allow infinitely many tokens there should only be finitely many symbols. In the bc grammar STRING is a symbol with infinitely many tokens. Although it’s not specified in the formal part of the grammar the lexical analyser recognises almost any string of characters as a STRING. Some tokens are likely to be in a group by themselves, in which case people tend to blur the token/symbol distinction. Technically, for example, there are two NEWLINEs, the token and the symbol, a set of tokens whose only element is the NEWLINE token. A lot of people blur the distinction even when there are multiple symbols in a group and use the words symbol and token as interchangeable.

The symbols we’ve just discussed, the ones which are groups of tokens are called terminal symbols or just terminals. There are other symbols, conveniently called nonterminal symbols or just nonterminals. In the bc example program, input_item and semicolon_list are nonterminals. In fact the non-terminals precisely the things you see listed on the left hand sides of all the grammar rules after the line

%%

One of these symbols has a special status. It is called the start symbol. In the example above program is the the start symbol. You can tell because of the line

%start    program

Not everyone labels the start symbol. If it’s not labelled then the convention is that it’s the one on the left hand side of the first grammar rule. A context free grammar is a finite set of grammar rules, also sometimes called production rules. Each of these grammar rules describes possible ways to build up a nonterminal symbol from other symbols, which might be terminal or nonterminal. For example a program can be an EOF symbol, which is just the end of file marker, or an input_item followed by a program. The EOF symbol is terminal while input_item and program are nonterminal. An input_item can be a semicolon_list followed by a NEWLINE or a function. NEWLINE is a terminal while semicolon_list and function are non-terminal. The | character separates the distinct possibilities in each case.

You can see from the rule for program that when I wrote that “each of these grammar rules describes possible ways to build up a nonterminal symbol from other symbols” I didn’t mean the word “other” to exclude the possibility of the same symbol occurring on both the left hand side and the right hand side of a rule. In other words, rules can be recursive.

As might be expected from a field where people from radically different fields, like computer science, linguistics and mathematics, not everyone uses the same terminology and the specifics of how grammar rules are written can vary a lot. Most differences are minor but one is quite significant: whether the “other symbols” referred to above, used to build up a nonterminal symbol, could include no symbols. The three conventions in common use are to allow this in all cases, to allow it only for the start symbol, and do allow it for no symbols. The authors of the specification belong to the tradition in which this is allowed for all symbols, as you can see from the rules for semicolon_list, statement_list, opt_parameter_list and a number of others. The /* empty */ things you see there, like everything between a /* and a following */, are comments, which are there to draw attention to the fact that each of these symbols could be built from no symbols, and the meaning would be the same if they were omitted.