Encoding grammar

Grammars are defined mostly in terms of concatenation. Consider, for example, our grammar for balanced parentheses. A string belongs to the grammar of balanced parentheses if and only if we can parse it using the rule

ok  : | "(" ok ")" ok

The string, (()(())), for example, can be parsed in the following steps.

ok
( ok ) ok
( ( ok ) ok ) ok
( ( ) ok ) ok
( ( ) ( ok ) ok ) ok
( ( ) ( ( ok ) ok ) ok ) ok
( ( ) ( ( ) ok ) ok ) ok
( ( ) ( ( ) ) ok ) ok
( ( ) ( ( ) ) ) ok
( ( ) ( ( ) ) )

This is a sequence of lists of symbols in the grammar, i.e. an element of our extended grammar, except that I’ve used newlines instead of commas and I’ve inserted some spaces to make things look nicer. This sequence has four crucial properties.

The first list has only one element, the start symbol.
The last list is the one we were trying to parse.
The last list consists only only of terminal symbols.
Each list is obtained by taking a previous list and replacing an ok by one of its two allowed expansions.

Each of these properties can be expressed in terms of concatenation. Our first task is to separate the one large list of symbols that we have into smaller lists, separated by our separator, which I’m going to refer to as the comma, even though I used newlines above. Note that being a sublist of another list is a property we can describe in terms of concatenation. β is a sublist of δ if we can write δ as the concatenation of α, β and γ. A list contains a comma if and only if and only if it has a sublist with a single element, a comma. The pieces we want to split our list into are the largest comma-less pieces, i.e. those which have no comma but are not sublists of any other comma-less list. Having split the list in this way it’s easy to express the first and last condition above. The second one can be handled similarly to the problem of identifying comma-less lists before. We just need to do that, restricted to the final part of the list, with each non-terminal in place of a comma. The third one isn’t much harder. It’s straightforward to express in terms of concatenation the fact that one of these maximal lists appears before another in the sequence. β precedes δ if the whole list can be written as the concatenation of α, β, γ, δ and ε. We can also express the fact that δ is obtained from β by expanding an ok to an empty list. That means that there are κ, λ, μ, and ν such that β is κλν, δ is κμν, λ consists of a single ok, and μ is empty. Expressing the fact that δ is obtained from β by expanding an ok to a list of the form ( ok ) ok is similar, but we use that list of four symbols for μ. We’ve now expressed all the conditions above in terms of quantifiers, Boolean operators, and concatenation.

We can now use our earlier observation that concatenation is represented by a constructively arithmetic operation to express parsing in arithmetic terms. Specifically we can express the fact that a given sequence is a valid parsing for a given list of terminal symbols as an arithmetic relation between their encodings. In fact this relation is constructively arithmetic. To see this it suffices to observe that all the lists we need to consider are sublists of the given parsing list and that sublists have encodings which are no larger than the encoding of the full list. The property of belonging to the list of balanced parentheses is also an arithmetic property, since that just means there exists some natural number which encodes a valid parsing, so we just need to stick another quantifier on to whatever expression we’ve constructed to express the relation of being a valid parsing for a list. There’s one complication though. We don’t have any bound for the encoding of the parsing in terms of the encoding of the original list, so this won’t be a bounded quantifier and we can only conclude that the encodings of members of the language form an arithmetic set. We don’t gain any information about from this about whether it’s constructively arithmetic, but that’s less important.