Natural languages

Generative grammar was originally developed for natural languages rather than for formal languages. As it turns out, natural languages are rather messy and this model doesn’t fit them fully, but you’ve been exposed to them for larger so it may be easier to understand some of the concepts in that context. The usual way to do this is to take words as tokens. What people in formal languages call an “alphabet” is then not what we would normally think of as the alphabet of the language but is in fact its full vocabulary.

The lexical analyser splits a stream of characters into words. This is easy in a language like English, where we put spaces between words, but more difficult in a languages like Japanese, which does not. Even for English things are a bit more complex than just splitting at white space, but let’s not worry about this now. The set of English words is theoretically finite, but not in any particularly useful way. There are languages, e.g. Turkish, where the set of words is arguably infinite. Terminal symbols represent types of words, often called “parts of speech”, like nouns, verbs, adjectives, etc. There is a finite, and indeed rather small, set of such terminals. In formal language theory we generally assume that each token belongs to one and only one terminal. Some natural languages more or less satisfy this condition. English very much does not. The word “fast”, for example, can be a noun, verb, adjective or adverb. Having raised these issues I will now ignore them, and pretend we have a lexical analyser which can reliably split our input into tokens, i.e. words, and uniquely identify their terminals, i.e. parts of speech.

A full grammar for a language like English would be very complicated but it’s surprisingly easy to write down simplified grammars which are nonetheless sufficient to parse even fairly complex sentences. The following example is from Masaru Tomita:

sentence : noun_phrase verb_phrase | sentence prep_phrase
noun_phrase : NOUN | DETERMINER NOUN | noun_phrase prep_phrase
prep_phrase : PREPOSITION noun_phrase
verb_phrase : VERB noun_phrase

There are four terminals, NOUN, DETERMINER, PREPOSITION and VERB and four nonterminals, sentence, noun_phrase, prep_phrase, and verb_phrase. This grammar is clearly recursive, since sentence is defined in terms of itself, as is noun phrase. Grammars for natural languages are always recursive.

There are obviously important parts of English, like pronouns, adjectives and adverbs, which are missing, but we can still parse sentences like “Masaru saw a man in the apartment with a telescope” with only this fragment of English grammar. One possible parse tree is given in the figure. There are many more parse trees for this sentence though. You might want to see how many you can find.

Full parse tree for Tomita’s example