Encoding

We can encode any formal language into natural numbers. There are in fact a number of ways to do this. The simplest, which works when the number of tokens is finite, is to let \({ b }\) be the number of tokens, assign each token to one of the “digits” \({ 1 }\), \({ 2 }\), …, \({ b }\), and use the base \({ b + 1 }\) representation of natural numbers. In more detail, given an element of the given language, i.e. a list of tokens which belongs to the language, we replace each token by the corresponding digit and view the result as a natural number represented in base \({ b + 1 }\). In the other direction, given a natural number we can form its base \({ b + 1 }\) representation, which is a list of digits, and, if the digit 0 does not appear, we can replace those digits by the corresponding tokens, obtaining a list of tokens. This list may or may not be an element of the language. This is, of course, the reverse of the process by which we generated natural numbers from lists of tokens. In this way we can identify the language with a subset of the natural numbers.

You may wonder why I used base \({ b + 1 }\) and avoided using the 0 digit in the encoding above. The reason is technical. If we allowed 0 and the corresponding token is one which could occur as the initial token in an element of the language then the natural number representation of that string would start with a 0. Removing that 0 would give me the same natural number, but corresponding to a shorter list of tokens, which might or might not be an element of the language but certainly isn’t the same element of the language. Our encoding therefore would be lossy; it would be impossible to recover lists of tokens uniquely from natural numbers. Since we can’t use 0 as a digit we have to use base \({ b + 1 }\) rather than base \({ b }\).

As an example of the encoding above, consider the element (()(())) in the language of balanced parentheses. This language has two tokens so we we use base 3. Associating ( to 1 and ) to 2 we replace (()(())) by 11211222, which is the base 3 representation of the number whose decimal representation is 3536. So this number is the encoding of (()(())).

The method above can be modified to cope with languages with infinitely many tokens, provided the set of tokens is countable, a term which will be defined in the next chapter. All the languages we’ll encode will have only finitely many tokens though.

There are other ways to encode languages as natural numbers. The precise encoding chosen doesn’t affect the validity of anything I’ll write below, although some encodings make proofs easier and some make them harder. The particular encoding described above is simple but tends to make proofs hard. I generally won’t be providing proofs though so it’s not worth the effort of setting up an encoding which is better adapted providing proofs.

The most important thing to know about encodings of languages is that if the language is defined by a phrase structure grammar then the set of natural numbers corresponding to lists in the language is arithmetic. This is quite painful to prove, unfortunately.