Regular expressions

Regular expressions are both an important theoretical concept in computing and an important practical tool in programming. These two meanings for regular expressions are not quite the same though. For theoretical purposes it’s convenient to have a minimally expressive syntax for regular expressions. The fewer ways to construct a regular expression the easier it is to prove their properties. For practical programming it’s convenient to have a maximally expressive syntax, to make it easier to write simple regular expressions for simple tasks. To complicate matters even further, most regular expression libraries provide not just syntactic sugar to make writing regular expressions easier but also extensions which increase their power as a computational tool. That may sound good if you’re a programmer but it means that some of the statements I’ll make below about what regular expressions can and can’t do are simply untrue when applied to regular expressions as understood by those libraries. Since this module is concerned with the theory of computation rather than practical programming I will give the minimalist version but I will briefly mention the IEEE standard regular expressions understood by most libraries.

The basic operations

A language is called regular if it can be build from finite languages by the operations of union, concatenation and Kleene star. More precisely, suppose \({ A }\) is a finite set of tokens. Let \({ F }\) be the set of finite languages with tokens in \({ A }\), i.e. the set of finite sets of lists of items in \({ A }\). Let \({ S }\) be the set of all sets of languages defined by:

Then the set of regular languages for the set of tokens \({ A }\) is \({ ∩ S }\).

The notation \({ ∘ }\) for concatenation is unfortunate since it suggests composition but is in fact unrelated to it.

Intuitively, regular languages are built from finite languages using union, concatenation and Kleene star and are only those languages which can be built in a finite number of steps of those three types from finite languages.

Regular expressions are a notation for describing how a language is built from a set of finite languages using those components. There are a few different notations for regular expressions. I’ll use one which is a subset of the IEEE notation. This has the limitation that it only works when the tokens are individual characters, but that’s the case of most practical interest.

In IEEE regular expressions characters represent themselves, except that a few characters are special. To represent a special character we need to place a backslash \, before it. The special characters include \ itself, the parentheses ( and ), used for grouping, the vertical bar |, used for the union operation, and the asterisk *, used for the Kleene star operation. There is no special character for concatenation. Concatenation is indicated by concatenation.

There is exactly one regular language which cannot be expressed in this notation: the empty language. The IEEE standard has no way of denoting this language but I’ll use \({ ∅ }\) for it.

Examples

It is probably easier to understand this via examples. x* is a regular expression for the language consisting of arbitrarily many copies of the character x. Similarly y* is a regular expression for arbitrarily many y’s. Then x*y* is a regular expression for arbitrarily many x’s followed by arbitrarily many y’s. In other words, x*y* is a regular expression for the xy language considered earlier.

Arbitrarily many includes zero, so the empty string is an element of this language. If we wanted to ensure that there is at least one x and at least one y we would have to use the regular expression xx*yy*, i.e. a single x, followed by arbitrarily many x’s, followed by a single y, followed by arbitrarily many y’s. This is, of course, a different language from the one in the preceding paragraph.

As our next example let’s try to build a regular expression for the language of integers. It will be easiest to do this in stages. 0|1|2|3|4|5|6|7|8|9 is regular expression for the language of single digits. We don’t need parentheses for grouping here because union is an associative operation. To get strings of digits we apply Kleene star: (0|1|2|3|4|5|6|7|8|9)*. The parentheses are needed for precedence, specifically to express the fact that this arbitrarily many digits rather than either a non-9 or arbitrarily many 9’s, which would be 0|1|2|3|4|5|6|7|8|(9*). The language described by the regular expression (0|1|2|3|4|5|6|7|8|9)* includes the empty string and also 007. The former is not an integer and the latter is an integer but doesn’t satisfy the normalisation conditions we imposed earlier to make sure each integer has a unique representation. Both of these problems can be fixed by making sure there’s a non-zero digit before the (0|1|2|3|4|5|6|7|8|9)*. A regular expression for non-zero digits is 1|2|3|4|5|6|7|8|9 so we’re led to

(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*.

This is a regular expression for the language of positive integers. To allow negative integers we concatenate the regular expression |- with this regular expression to get

(|-)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*.

A |- matches either the empty string or a single -. We now have a regular expression which matches all nonzero integers. A regular expression which matches zero is just 0. The following regular expression therefore matches all integers:

(0|(|-)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*).

Towards the end of the example I started using the standard term “match” for the situation where a string is a member of the language described by a regular expression. It’s quite convenient and I’ll use it without comment from now on.

From regular expressions to grammars

Converting a regular expression to a left or right regular grammar for the language it recognises might seem difficult, and doing it efficiently is indeed difficult, but as long as we just want some regular grammar and don’t care about efficiency it’s easy.

First of all, it’s easy to write down a grammar for a single character. For example, here is a grammar for the language whose only string is a single x:

start : one_x
     
one_x : "x" empty
empty :
     

This is a right regular grammar. A left regular grammar for the same language would look the same, except the expansion for one_x would be empty x.

We’ve already seen how to construct regular grammars for the union, concatenation or Kleene star of languages from regular grammars for the original languages though, and regular expressions are built up from grammars for a single character or the empty string using those three operations so in principle we know how to construct either a left or right grammar for the language described by any regular expression. This procedure will generally give us an unnecessarily complicated grammar for the language, but it will give us a grammar. It follows that every regular language can be described by a left regular grammar and by a right regular grammar.

Regular expressions from automata

Any language which can be recognised by a finite state automaton can be described by a regular expression. This is proved by induction on the number of states. To make the induction work though we first need to generalise our notion of finite state automata. The automata we’ve considered read single tokens and make a state transition based on the token read. For each pair of states \({ r }\) and \({ s }\) there is a set, possibly empty, of tokens which all a transition from \({ r }\) to \({ s }\). We can represent this set of tokens by a regular expression, consisting of those tokens with |’s between them.

A generalised finite state automaton will be one where for each pair of states \({ r }\) and \({ s }\) there is a regular expression such that if the automaton reads a list of tokens matching that regular expression while in the state \({ r }\) then it can transition to the state \({ s }\). Every finite state automaton is a generalised finite state automaton, where the regular expression is just the one described previously, i.e. the list of tokens separated by |’s. We can add two states to this automaton to create a new one which has only a single initial state and a single accepting state. To do this we demote the previous initial states and accepting states to ordinary states and add a new initial state and a new accepting state. We then allow transitions from the new initial state to the old ones where the regular expression is the one matching the empty list and transitions from the old accepting states to the new one, again with the regular expression being the one for the empty list. The new automaton can therefore go from the new initial state to one of the old ones without reading any input, then proceed as before, arrive at one of the old accepting states at the end of its input, and then transition to the new accepting state without reading any further input. The states other than the initial and accepting states will be called intermediate states.

Suppose we have a generalised finite state automaton with a single initial state and a single accepting state and at least one intermediate state. We can construct another generalised finite state automaton which recognises the same language, also with a single initial state and a single accepting state, but with one intermediate state fewer, as follows.

We pick an intermediate state \({ r }\). We want to remove \({ r }\) but some computational paths go through \({ r }\) so we will need to replace them with paths which don’t. For each pair of other states \({ q }\) and \({ s }\) there are possibly paths which go from \({ q }\) to \({ r }\) and then from \({ r }\) to \({ s }\). We can replicate the effect of those paths by unioning the existing regular expression for transitions from \({ q }\) to \({ s }\) with the concatenation of the regular expression for transitions from \({ q }\) to \({ r }\) with the one for transitions from \({ r }\) to \({ s }\). This isn’t quite enough though, since a computational path might stay at \({ r }\) for an arbitrary number of steps before moving on to \({ s }\). So what we need to add to the regular expression for transitions from \({ q }\) to \({ s }\) is a concatenation of three regular expressions: the one for transitions from \({ q }\) to \({ r }\), the Kleene star of the one for transitions from \({ r }\) to itself, and then the one for transitions from \({ r }\) to \({ s }\). Once we’ve done this for all pairs \({ q }\) and \({ s }\) the state \({ r }\) is no longer needed and can be removed.

Removing intermediate states one after another we eventually reach the point where there are no intermediate states. We’re left with just initial and accepting states. If we did the construction as described above then there is one of each and there are no allowed transitions from the initial state to itself or from the accepting state to itself or to the initial state. The only allowed transition is one directly from the initial state to the accepting state. Input will be accepted by this machine if and only if it matches the regular expression for that transition. In this way we’ve found a regular expression which matches precisely those inputs recognised by the original finite state automaton.

I’m not going to give an example of the construction above. The regular expressions it produces are horribly inefficient. The point of the construction is just to prove that every language recognised by a finite state automaton is a regular language.

Reversal

One nice property of regular expressions is that given a regular expression for a language it’s very easy to construct a regular expression for the reversed language. Unions and Kleene stars can be left unchanged and we just need to reverse the order of the concatenations. For example, a regular expression for the reversed integers is

(0|(0|1|2|3|4|5|6|7|8|9)*(1|2|3|4|5|6|7|8|9)(|-)).

Extended syntax

For practical purposes it’s useful to have more operations available than just union, concatenation and Kleene star. We didn’t include those extra operations in the definition to avoid needing to prove the corresponding closure properties of regular grammars.

In the IEEE standard + is used for Kleene plus, i.e. a concatenation with at least one member. ? is used for at most one member, but possibly zero. Also explicit ranges are allowed, denoted by numbers in braces. For example (0|1|2|3|4|5|6|7|8|9){3,5} would indicate a string of at least three and at most five digits. Character ranges are also allowed, indicated by brackets, so three to five digits could also be represented as [0-9]{3,5}. Of course this requires +, ?, braces and brackets to be special characters, which then have to be preceded by backslashes in order to represent themselves. There are a few other similar extensions. If you’re using regular expressions for pattern matching, and you really should if you have to do pattern matching, then you should consult the documentation for whatever library you’re using, both to see what extensions are available and to see which characters are special. Even if you will never use an extension you may need to know about it if it makes certain characters special and therefore requires you to precede them with backslashes.

The extensions above are a matter of syntactic convenience. They don’t change the set of languages which can be represented; they just make it possible to represent some languages with shorter regular expressions. There are other extensions in many implementations which change the set of representable languages. The new languages which these allow cannot be generated by regular grammars. None of what I say in this chapter about regular languages can be assumed to apply to the languages described using these extensions.