Encoding

Suppose we want to encode lists of symbols as natural numbers. We don’t necessarily need the set of symbols to be finite but it makes things easier and that special case is sufficient for our present purposes. For applications to languages I’m going to assume that there is only one token per terminal symbol. If the set of tokens is finite then we can always arrange this, but also it happens to be true already for most of the languages we’ll consider, including the language of elementary arithmetic considered in this chapter.

One tempting idea, if we have \({ b }\) tokens, is to use a base \({ b }\) encoding, where we assign one digit to each symbol and just convert a list of symbols to the corresponding list of digits, and then convert that to the natural number whose base \({ b }\) representation is that list of digits. There’s a subtle problem with this idea though, which is that some symbol will be assigned the digit 0 and two lists which differ only in the number of occurrences of that symbol at the start of the list will give lists of digits differ only in the number of leading 0’s and will therefore be encoded as the same natural number. This is therefore a lossy encoding.

There’s an easy fix to the problem above. We can use a base \({ b + 1 }\) encoding, and just not use the digit 0 for any of the symbols. This isn’t a terrible idea but it does have one disadvantage. Unlike the previous encoding, not every natural number will correspond to to a string of symbols. Only those natural numbers whose base \({ b + 1 }\) encoding have no 0 digits will encode lists of symbols. That’s better than having some natural number represent multiple lists, but it’s still somewhat inconvenient.

Following an idea of Raymond Smullyan we can modify the base \({ b }\) encoding as follows. Suppose that, just as with ordinary base \({ b }\), we associate the list of digits \[ ( d _ 0 , d _ 1 , \ldots , d _ { l - 2 } , d _ { l - 1 } ) \] with the natural number \[ d _ 0 · b ^ { l - 1 } + d _ 1 · b ^ { l - 2 } + \cdots + d _ { l - 2 } · b + d _ { l - 1 } \] but instead of choosing the digits from the set \({ \{ 0 , 1 , \ldots , b - 2 , b - 1 }\) we choose them from \({ \{ 1 , 2 , \ldots , b - 1 , b \} }\). Like the base \({ b + 1 }\) encoding above this encoding is lossless, since it’s fairly easy to show that no two lists give the same natural number. It still has the minor problem that not every natural number is the encoding of a list, but this time the only number which is left out is 0, a problem which we can easily fix by subtracting 1 from the expression above. So our new encoding is to assign the digits \({ \{ 1 , 2 , \ldots , b - 1 , b \} }\) to our symbols, convert the list of symbols to a list of digits, and then form the natural number \[ d _ 0 · b ^ { l - 1 } + d _ 1 · b ^ { l - 2 } + \cdots + d _ { l - 2 } · b + d _ { l - 1 } - 1 . \] Now every list of symbols is represented by a natural number and every natural number represents a list of symbols.

Once we know how to encode lists of symbols as natural numbers we can encode sets of lists of symbols as sets of natural numbers, relations on lists of symbols as relations on natural numbers, etc.

As an example, consider the language of balanced parentheses. Our original grammar was

ok : | "(" ok ")" ok

For reasons which will become clear soon I want to allow not just strings with balanced parentheses by finite sequences of such strings, separated by commas, so our new language is

seq : ok | seq "," ok
ok  : | "(" ok ")" ok

Strings with balanced parentheses can then be thought of as sequences with only one element. This new grammar has five symbols, the three terminals (, ) and , and the two non-terminals seq and ok. We can assign those five symbols to the five base five digits 1, 2, 3, 4 and 5, in that order. Then the string (()(())), for example, would be encoded as \[ 1 · 5 ^ 7 + 1 · 5 ^ 6 + 2 · 5 ^ 5 + 1 · 5 ^ 4 + 1 · 5 ^ 3 + 2 · 5 ^ 2 + 2 · 5 + 2 - 1 \] This is the decimal number 100811, but there really isn’t much reason to convert these encodings to and from decimal.

The list of symbols in the example above happened to be a member of the language, in fact a member of both the original language and the extended one, but this encoding scheme allows us to represent any list of symbols as a natural number. This raises an obvious question: can we identify which natural numbers correspond to members of the language? There are actually two languages here, the original language of strings with balanced parentheses and the extended language of sequences of such strings. Although the answer turns out to be the same for both the question we’re really interested in is the one for the original language.

The encoding system described above has the very useful property that the concatenation relation is arithmetic, and in fact constructively arithmetic. If you’re not familiar with concatenation, the concatenation of the list \[ ( c _ 0 , c _ 1 , \ldots , c _ { k - 2 } , c _ { k - 1 } ) \] and the list \[ ( d _ 0 , d _ 1 , \ldots , d _ { l - 2 } , d _ { l - 1 } ) \] is the list \[ ( c _ 0 , \ldots , c _ { k - 1 } , d _ 0 , \ldots , d _ { l - 1 } ) . \] Suppose \({ x }\) is the encoding of the first list above, \({ y }\) is the encoding of the second list, and \({ z }\) is the encoding of their concatenation, i.e. third list. In other words, suppose that \[ x ' = c _ 0 · b ^ { k - 1 } + c _ 1 · b ^ { k - 2 } + \cdots + c _ { k - 2 } · b + c _ { k - 1 } , \] \[ y ' = d _ 0 · b ^ { l - 1 } + d _ 1 · b ^ { l - 2 } + \cdots + d _ { l - 2 } · b + d _ { l - 1 } , \] and \[ z ' = c _ 0 · b ^ { k + l - 1 } + \cdots + c _ { k - 1 } · b ^ l + d _ 0 · b ^ { l - 1 } + \cdots d _ { l - 1 } . \] Then \[ z ' = b ^ l · x ' + y ' . \]

Aside from the parentheses, which we can easily add, the thing which prevents this from being an expression in our language is the power \({ b ^ l }\). At this point it is helpful to make the additional assumption that \({ b }\) is prime. This is avoidable but avoiding it requires considerable effort and in all the applications we have in mind we have the option of adding more symbols and simply not using them so there is no real loss of generality. Of course what we’re using here is the fact that for every natural number there is a larger number which is prime, i.e. that there are infinitely many primes, mentioned earlier. In this case \({ z }\) is a power of \({ b }\) if and only if every divisor of \({ z }\) is either \({ 1 }\) or \({ b }\). Suppose \({ z }\) is positive. Then we only need to test divisors between \({ 1 }\) and \({ z }\), since there are no others. \({ x }\) is a divisor if and only if there is a \({ y }\) such that \({ x · y = z }\). Again, if \({ z }\) is positive then we only need to test values of \({ y }\) between \({ 1 }\) and \({ z }\). So \({ x = v ' }\) and \({ y = w ' }\) for some natural numbers \({ v }\) and \({ w }\) less than \({ z }\). The condition that \({ x = 1 }\) is equivalent to \({ v = 0 }\). The condition that \({ x }\) is a multiple of \({ b }\) is that there is a \({ y }\) such that \({ b · y = x }\). Again, we only need to consider values of \({ y }\) between \({ 1 }\) and \({ x }\), which are therefore between \({ 1 }\) and \({ z }\). Again, since \({ y }\) is positive it must be of the form \({ y = w ' }\) for some \({ w }\). In this way we are led to the bounded expression \[ \begin{split} Q ( z ) & ≡ \{ ( z > 0 ) ∧ [ ∀ v < z : ( \{ ∃ w < z : [ ( v ' · w ' ) = z ] \} \\ & \qquad {} ⊃ [ ( v = 0 ) ∨ \{ ∃ w < z : [ ( b · w ' ) = v ' ] \} ] ) ] \} . \end{split} \] If you compare this to the earlier expression for prime powers you’ll see that in addition to replacing \({ p }\) with \({ b }\) I’ve added the explicit condition \({ z > 0 }\) and I’ve used \({ v }\) and \({ w }\) instead of \({ x }\) and \({ y }\). The second change isn’t a simple substitution, since the relations between these are \({ x = v ' }\) and \({ y = w ' }\). Most importantly, I’ve bounded all the quantifiers. The trick of using \({ v }\) and \({ w }\) instead of \({ x }\) and \({ y }\) was used to get variables which are strictly less than \({ z }\), rather than merely less than or equal to it, as required for bounded quantifiers. This also required ruling out the case \({ z = 0 }\) explicitly, because there is no upper bound on the divisors of \({ 0 }\).

\({ y ' }\) is a number whose modified base \({ b }\) representation has length \({ l }\). The smallest such number is the one where all the digits are \({ 1 }\), i.e. \[ 1 · b ^ { l - 1 } + 1 · b ^ { l - 2 } + \cdots + 1 · b + 1 = \frac { b ^ l - 1 } { b - 1 } \] The largest such number is the one where all the digits are \({ b }\), i.e. \[ b · b ^ { l - 1 } + b · b ^ { l - 2 } + \cdots + b · b + b = b \frac { b ^ l - 1 } { b - 1 } \] In other words, \[ \frac { b ^ l - 1 } { b - 1 } \le y ' \] and \[ y ' \le b \frac { b ^ l - 1 } { b - 1 } \] This is equivalent to \[ b ^ l + y \le b y ' \] and \[ b y + b + b \le b · b ^ l + y ' . \] In other words, \({ w = b ^ l }\) if and only if \({ w }\) is a power of \({ b }\) and satisfies the inequalities \[ w + y \le b y ' \] and \[ b y ' + b \le b · w + y ' . \] So \({ b ^ l }\) is the unique \({ w }\) which makes the expression \[ S ( w , x , y , z ) ≡ [ Q ( w ) ∧ ( [ ( w + y ) \le ( b · y ' ) ] ∧ \{ [ ( b · y ) + b ] \le [ ( b · w ) + y ' ] \} ) ] \] true. Now we can express the condition for \({ z }\) to be encoding of the concatenation of the lists encoded by \({ x }\) and \({ y }\). \[ C ( x , y , z ) ≡ \{ ∃ w < y : [ S ( w , x , y , z ) ∧ \{ [ ( w · x ' ) + y ' ] = z ' \} ] \} . \]