Closure properties

We can construct complicated languages from simpler languages in a variety of ways. It’s useful to be able to construct a grammar for the more complicated language from grammars for the simpler languages from which it’s built.

Unions

Languages are sets of lists. As sets it makes sense to talk about unions, intersections and relative complements. The union of two languages is again a language, as is the intersection or relative complement. Given left regular grammars for a pair of languages can we give a left regular grammar for their union? For the intersection? For the relative complement? The answer in each case is yes, but this is only easy to do for the union. We just need to create a new rule for the start symbol, which includes all the alternates for the start symbols of the original two languages, and copy all the rule for the other symbols, changing names if necessary to avoid duplicates. So the union of the two languages above has the grammar

%start union

%%

union   : xxyy | zero | pos_int | neg_int
        ;

xxyy    : | xxyy y | xx x 
        ;

xx      : | xx x | error y
        ;

error   : error x | error y
        ;

zero    : 0
        ;

neg_int : - pos_int
        ;

pos_int : 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

digits  : | 0 digits
        | 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

Of course the same remarks apply to right regular grammars as well. The union of two languages with right regular grammars has a right regular grammar. Also, the construction above is easily adapted to the union of finitely many languages.

Concatenation

We can also define a language whose members are the concatenation of a member of the first language and a member of the second. As with the union, we can construct a grammar for this new language from grammars for the two old languages by a purely mechanical procedure, though this time it’s rather more complicated. It may be easiest to understand the procedure through examples. Consider, then, the language consisting of integers followed by some number of x’s and then some number of y’s.

We can start from our right regular grammar for the integers. Instead of an empty string at the end the of the input we should now have a string in the xy language, so we replace the empty alternatives in the right regular grammar for integers with the start symbol in the right regular grammar for the xy language.

%start intxxyy

%%

intxxyy : zero | pos_int | neg_int
        ;

zero    : 0
        ;

neg_int : - pos_int
        ;

pos_int : 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

digits  : weird | 0 digits
        | 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

In this case there was only empty alternative, in the rule for the digits symbol. I haven’t added in the rules from the xxyy grammar yet because we have a problem. weird is a non-terminal symbol and digits is also a non-terminal symbol, but the alternatives in a rule for a non-terminal symbol in a right regular grammar should be empty or a terminal followed by a non-terminal. As a first step to fixing this we need to replace weird by its possible expansions.

digits  : xxyy | 0 digits
        | 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

There was in fact only one alternate, namely xxyy. This also isn’t suitable as an alternate in a rule for a non-terminal symbol so we need to replace it by it’s possible expansions.

digits  : | x xxyy | y yy | 0 digits
        | 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

Now we have a rule of the required form. We need to include more rules from the right regular grammar for the xy language so we can expand the symbols xxyy and yy. The complete grammar for the concatenation language is

%start intxxyy

%%

intxxyy : zero | pos_int | neg_int
        ;

zero    : 0
        ;

neg_int : - pos_int
        ;

pos_int : 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

digits  : | x xxyy | y yy | 0 digits
        | 1 digits | 2 digits | 3 digits
        | 4 digits | 5 digits | 6 digits
        | 7 digits | 8 digits | 9 digits
        ;

xxyy    : | x xxyy | y yy
        ;

yy      : | y yy | x error
        ;

error   : x error | y error
        ;

We can also construct a left regular grammar for this concatenation language from the left regular grammars for the integer language and the xy language. This time we start from the left regular grammar for the xy language and replace the empty alternates with the start symbol for the integer language.

%start intxxyy

%%

intxxyy : xxyy
        ;

xxyy    : integer | xxyy y | xx x 
        ;

xx      : integer | xx x | error y
        ;

error   : error x | error y
        ;

This grammar is not of the required form though so we need to replace integer by its possible expansions. Those rules will still not be be of required form, so we replace those replacements. The new rules for xxyy and xx are then

xxyy    : zero | nonzero | xxyy y | xx x 
        ;

xx      : zero | nonzero | xx x | error y
        ;

Those rules will still not be be of required form, so we replace those replacements. The new rules for xxyy and xx are then

xxyy    : empty 0 |
        | nonzero 0 | nonzero 1 | nonzero 2 | nonzero 3 | nonzero 4
        | nonzero 5 | nonzero 6 | nonzero 7 | nonzero 8 | nonzero 9
        | head 1 | head 2 | head 3 | head 4 | head 5
        | head 6 | head 7 | head 8 | head 9
        | xxyy y | xx x 
        ;

xx      : empty 0 |
        | nonzero 0 | nonzero 1 | nonzero 2 | nonzero 3 | nonzero 4
        | nonzero 5 | nonzero 6 | nonzero 7 | nonzero 8 | nonzero 9
        | head 1 | head 2 | head 3 | head 4 | head 5
        | head 6 | head 7 | head 8 | head 9
        | xx x | error y
        ;

The full grammar is then

%start intxxyy

%%

intxxyy : xxyy
        ;

xxyy    : empty 0 |
        | nonzero 0 | nonzero 1 | nonzero 2 | nonzero 3 | nonzero 4
        | nonzero 5 | nonzero 6 | nonzero 7 | nonzero 8 | nonzero 9
        | head 1 | head 2 | head 3 | head 4 | head 5
        | head 6 | head 7 | head 8 | head 9
        | xxyy y | xx x 
        ;

xx      : empty 0 |
        | nonzero 0 | nonzero 1 | nonzero 2 | nonzero 3 | nonzero 4
        | nonzero 5 | nonzero 6 | nonzero 7 | nonzero 8 | nonzero 9
        | head 1 | head 2 | head 3 | head 4 | head 5
        | head 6 | head 7 | head 8 | head 9
        | xx x | error y
        ;

error   : error x | error y
        ;

head    : | empty -
        ;

empty   :
        ;

The general procedure constructing a grammar for the concatenation language from the grammars for a pair of languages is as follows.

Rename symbols to avoid name conflicts between the two grammars. Optionally rename other symbols for clarity.
If we’re constructing a left regular grammar start from a left regular grammar for the right element of the pair of languages. If we’re constructing a right regular grammar start from a right regular grammar for the left element of the pair of languages.
Replace all empty alternates with all alternates for all alternates of the start symbol in the other language.
Add all rules from the other language other than the one for its start symbol.

As a further example I’ll construct a right regular grammar for the concatenation of the xy language with itself. This time we’ll need the renaming step mentioned above. We take two copies of the the right regular grammar for the xy language.

%start weirdl

%%

weirdl : xxyyl
       ;

xxyyl  : | x xxyyl | y yyl
       ;

yyl    : | y yyl | x errorl
       ;

errorl : x errorl | y errorl
       ;

and

%start weirdr

%%

weirdr : xxyyr
       ;

xxyyr  : | x xxyyr | y yyr
       ;

yyr    : | y yyr | x errorr
       ;

errorr : x errorr | y errorr
       ;

We’re constructing a right regular grammar so we start from the grammar for the left language.

%start weirdl

%%

weirdl : xxyyl
       ;

xxyyl  : | x xxyyl | y yyl
       ;

yyl    : | y yyl | x errorl
       ;

errorl : x errorl | y errorl
       ;

We then replace both empty alternates with weirdr and then replace that with xxyyr and then that with | x xxyyr | y yyr.

%start weirdl

%%

weirdl : xxyyl
       ;

xxyyl  : | x xxyyr | y yyr | x xxyyl | y yyl
       ;

yyl    : | x xxyyr | y yyr | y yyl | x errorl
       ;

errorl : x errorl | y errorl
       ;

Then we add in rules from the other language.

%start weirdl

%%

weirdl : xxyyl
       ;

xxyyl  : | x xxyyr | y yyr | x xxyyl | y yyl
       ;

yyl    : | x xxyyr | y yyr | y yyl | x errorl
       ;

errorl : x errorl | y errorl
       ;

xxyyr  : | x xxyyr | y yyr
       ;

yyr    : | y yyr | x errorr
       ;

errorr : x errorr | y errorr
       ;

It’s important to understand which language we’ve just constructed a grammar for. A string is in this language if and only if it is the concatenation of two strings in the xy language. Those two strings could be the same but they don’t have to be. The question of whether we can construct a grammar for the language of two repetitions of the same string is one we’ll return to later.

Once we know how to construct a grammar for the concatenation of two languages we can construct a grammar for the concatenation of finitely many, by considering it at as a repeated concatenation.

Kleene star

Given a left or right regular grammar we can, using the techniques of the preceding section, construct, for each positive number \({ n }\), a grammar for the language whose members are concatenations of \({ n }\) members of the original language. Of course we can can also do this for \({ n = 0 }\). In this case the language consists of only the empty list and it has the grammar

%start start

%%

start : empty
      ;

empty :
      ;

We can also construct a grammar for the language of concatenations of between \({ m }\) and \({ n }\) members of a language, for natural numbers \({ m }\) and \({ n }\), using the union construction earlier. What’s less obvious is that we can construct a grammar for the language of concatenations of arbitrarily many members of a language.

Given a language the Kleene star of the language is set of all lists which can be found by concatenating arbitrarily many members of the language. This includes the empty list. The Kleene plus of the language is the set of lists which can be obtained by concatenating an arbitrary positive number of members of the language. If the original language had the empty list as a member then its Kleene star and Kleene plus are the same. If not then the Kleene star has the empty set as a member while the Kleene plus does not, but otherwise they are the same.

Given a regular grammar for a language we can find a regular grammar for its Kleene plus by looking through its rules for occurrences of the empty list and then adding alternates to any such rules. The alternates to be added are the alternates of alternates of the start symbol. To get a grammar for the the Kleene star we can apply our union construction discussed earlier to the Kleene plus grammar and the grammar given above for the language with just the empty list.

Applying the construction above to the xy language gives the following right regular grammar for its Kleene plus

%start weird

%%

weird : xxyy
      ;

xxyy  : | x xxyy | y yy
      ;

yy    : | y yy | x error | x xxyy | y yy
      ;

error : x error | y error
      ;

and the following grammar for its Kleene star

%start weird

%%

weird : xxyy | empty
      ;

xxyy  : | x xxyy | y yy
      ;

yy    : | y yy | x error | x xxyy | y yy
      ;

error : x error | y error
      ;

empty :
      ;

The constructions above are meant to show that regular grammars can be constructed for these languages. They do not attempt to find efficient grammars for them. For example, the Kleene star of the xy language is just the set of all lists of x’s and y’s. A much simpler grammar for this language is

%start ksxy

%%

ksxy : xyxy
     ;

xyxy : | x xyxy | y xyxy
     ;

Reversal

The reversal of a language is simply the set of the reversals of its members, where the reversal of a list is the list with the same items in reverse order. Given a left regular grammar for a language we can easily construct a right regular grammar for its reversal and vice versa. We just take the alternates in the rules for the grammar and reverse the order of the symbols. Here, for example, is a right regular grammar for the reversed integers, constructed from the left regular grammar for the integers.

integer : zero | nonzero
        ;

zero    : 0 empty
        ;

nonzero : 0 nonzero | 1 nonzero | 2 nonzero | 3 nonzero | 4 nonzero
        | 5 nonzero | 6 nonzero | 7 nonzero | 8 nonzero | 9 nonzero
        | 1 head | 2 head | 3 head | 4 head | 5 head
        | 6 head | 7 head | 8 head | 9 head
        ;

head    : | - empty
        ;

empty   :
        ;

What’s less clear is how to construct a left regular grammar for the reversal from a left regular grammar for the original language or a right regular grammar for the reversal from a regular regular grammar for the original. This is a question we’ll return to later.