Regular expressions are both an important theoretical concept in computing and an important practical tool in programming. These two meanings for regular expressions are not quite the same though. For theoretical purposes it’s convenient to have a minimally expressive syntax for regular expressions. The fewer ways to construct a regular expression the easier it is to prove their properties. For practical programming it’s convenient to have a maximally expressive syntax, to make it easier to write simple regular expressions for simple tasks. To complicate matters even further, most regular expression libraries provide not just syntactic sugar to make writing regular expressions easier but also extensions which increase their power as a computational tool. That may sound good if you’re a programmer but it means that some of the statements I’ll make below about what regular expressions can and can’t do are simply untrue when applied to regular expressions as understood by those libraries. Since this module is concerned with the theory of computation rather than practical programming I will give the minimalist version but I will briefly mention the IEEE standard regular expressions understood by most libraries.
A language is called regular if it can be build from finite languages by the operations of union, concatenation and Kleene star. More precisely, suppose \({ A }\) is a finite set of tokens. Let \({ F }\) be the set of finite languages with tokens in \({ A }\), i.e. the set of finite sets of lists of items in \({ A }\). Let \({ S }\) be the set of all sets of languages defined by:
Then the set of regular languages for the set of tokens \({ A }\) is \({ ∩ S }\).
The notation \({ ∘ }\) for concatenation is unfortunate since it suggests composition but is in fact unrelated to it.
Intuitively, regular languages are built from finite languages using union, concatenation and Kleene star and are only those languages which can be built in a finite number of steps of those three types from finite languages.
Regular expressions are a notation for describing how a language is built from a set of finite languages using those components. There are a few different notations for regular expressions. I’ll use one which is a subset of the IEEE notation. This has the limitation that it only works when the tokens are individual characters, but that’s the case of most practical interest.
In IEEE regular expressions characters represent themselves, except that a few characters are special. To represent a special character we need to place a backslash \, before it. The special characters include \ itself, the parentheses ( and ), used for grouping, the vertical bar |, used for the union operation, and the asterisk *, used for the Kleene star operation. There is no special character for concatenation. Concatenation is indicated by concatenation.
There is exactly one regular language which cannot be expressed in this notation: the empty language. The IEEE standard has no way of denoting this language but I’ll use \({ ∅ }\) for it.
It is probably easier to understand this via examples.
x*
is a regular expression for the language consisting of
arbitrarily many copies of the character x
. Similarly
y*
is a regular expression for arbitrarily many
y
’s. Then x*y*
is a regular expression for
arbitrarily many x
’s followed by arbitrarily many
y
’s. In other words, x*y*
is a regular
expression for the xy language considered earlier.
Arbitrarily many includes zero, so the empty string is an element of
this language. If we wanted to ensure that there is at least one
x
and at least one y
we would have to use the
regular expression xx*yy*
, i.e. a single x
,
followed by arbitrarily many x
’s, followed by a single
y
, followed by arbitrarily many y
’s. This is,
of course, a different language from the one in the preceding
paragraph.
As our next example let’s try to build a regular expression for the
language of integers. It will be easiest to do this in stages.
0|1|2|3|4|5|6|7|8|9
is regular expression for the language
of single digits. We don’t need parentheses for grouping here because
union is an associative operation. To get strings of digits we apply
Kleene star: (0|1|2|3|4|5|6|7|8|9)*
. The parentheses are
needed for precedence, specifically to express the fact that this
arbitrarily many digits rather than either a non-9
or
arbitrarily many 9
’s, which would be
0|1|2|3|4|5|6|7|8|(9*)
. The language described by the
regular expression (0|1|2|3|4|5|6|7|8|9)*
includes the
empty string and also 007
. The former is not an integer and
the latter is an integer but doesn’t satisfy the normalisation
conditions we imposed earlier to make sure each integer has a unique
representation. Both of these problems can be fixed by making sure
there’s a non-zero digit before the (0|1|2|3|4|5|6|7|8|9)*
.
A regular expression for non-zero digits is
1|2|3|4|5|6|7|8|9
so we’re led to
(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*.
This is a regular expression for the language of positive integers.
To allow negative integers we concatenate the regular expression
|-
with this regular expression to get
(|-)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*.
A |-
matches either the empty string or a single
-
. We now have a regular expression which matches all
nonzero integers. A regular expression which matches zero is just
0
. The following regular expression therefore matches all
integers:
(0|(|-)(1|2|3|4|5|6|7|8|9)(0|1|2|3|4|5|6|7|8|9)*).
Towards the end of the example I started using the standard term “match” for the situation where a string is a member of the language described by a regular expression. It’s quite convenient and I’ll use it without comment from now on.
Converting a regular expression to a left or right regular grammar for the language it recognises might seem difficult, and doing it efficiently is indeed difficult, but as long as we just want some regular grammar and don’t care about efficiency it’s easy.
First of all, it’s easy to write down a grammar for a single
character. For example, here is a grammar for the language whose only
string is a single x
:
start : one_x
one_x : "x" empty
empty :
This is a right regular grammar. A left regular grammar for the same
language would look the same, except the expansion for
one_x
would be empty x
.
We’ve already seen how to construct regular grammars for the union, concatenation or Kleene star of languages from regular grammars for the original languages though, and regular expressions are built up from grammars for a single character or the empty string using those three operations so in principle we know how to construct either a left or right grammar for the language described by any regular expression. This procedure will generally give us an unnecessarily complicated grammar for the language, but it will give us a grammar. It follows that every regular language can be described by a left regular grammar and by a right regular grammar.
Any language which can be recognised by a finite state automaton can
be described by a regular expression. This is proved by induction on the
number of states. To make the induction work though we first need to
generalise our notion of finite state automata. The automata we’ve
considered read single tokens and make a state transition based on the
token read. For each pair of states \({ r
}\) and \({ s }\) there is a
set, possibly empty, of tokens which all a transition from \({ r }\) to \({ s
}\). We can represent this set of tokens by a regular expression,
consisting of those tokens with |
’s between them.
A generalised finite state automaton will be one where for each pair
of states \({ r }\) and \({ s }\) there is a regular expression such
that if the automaton reads a list of tokens matching that regular
expression while in the state \({ r }\)
then it can transition to the state \({ s
}\). Every finite state automaton is a generalised finite state
automaton, where the regular expression is just the one described
previously, i.e. the list of tokens separated by |
’s. We
can add two states to this automaton to create a new one which has only
a single initial state and a single accepting state. To do this we
demote the previous initial states and accepting states to ordinary
states and add a new initial state and a new accepting state. We then
allow transitions from the new initial state to the old ones where the
regular expression is the one matching the empty list and transitions
from the old accepting states to the new one, again with the regular
expression being the one for the empty list. The new automaton can
therefore go from the new initial state to one of the old ones without
reading any input, then proceed as before, arrive at one of the old
accepting states at the end of its input, and then transition to the new
accepting state without reading any further input. The states other than
the initial and accepting states will be called intermediate states.
Suppose we have a generalised finite state automaton with a single initial state and a single accepting state and at least one intermediate state. We can construct another generalised finite state automaton which recognises the same language, also with a single initial state and a single accepting state, but with one intermediate state fewer, as follows.
We pick an intermediate state \({ r }\). We want to remove \({ r }\) but some computational paths go through \({ r }\) so we will need to replace them with paths which don’t. For each pair of other states \({ q }\) and \({ s }\) there are possibly paths which go from \({ q }\) to \({ r }\) and then from \({ r }\) to \({ s }\). We can replicate the effect of those paths by unioning the existing regular expression for transitions from \({ q }\) to \({ s }\) with the concatenation of the regular expression for transitions from \({ q }\) to \({ r }\) with the one for transitions from \({ r }\) to \({ s }\). This isn’t quite enough though, since a computational path might stay at \({ r }\) for an arbitrary number of steps before moving on to \({ s }\). So what we need to add to the regular expression for transitions from \({ q }\) to \({ s }\) is a concatenation of three regular expressions: the one for transitions from \({ q }\) to \({ r }\), the Kleene star of the one for transitions from \({ r }\) to itself, and then the one for transitions from \({ r }\) to \({ s }\). Once we’ve done this for all pairs \({ q }\) and \({ s }\) the state \({ r }\) is no longer needed and can be removed.
Removing intermediate states one after another we eventually reach the point where there are no intermediate states. We’re left with just initial and accepting states. If we did the construction as described above then there is one of each and there are no allowed transitions from the initial state to itself or from the accepting state to itself or to the initial state. The only allowed transition is one directly from the initial state to the accepting state. Input will be accepted by this machine if and only if it matches the regular expression for that transition. In this way we’ve found a regular expression which matches precisely those inputs recognised by the original finite state automaton.
I’m not going to give an example of the construction above. The regular expressions it produces are horribly inefficient. The point of the construction is just to prove that every language recognised by a finite state automaton is a regular language.
One nice property of regular expressions is that given a regular expression for a language it’s very easy to construct a regular expression for the reversed language. Unions and Kleene stars can be left unchanged and we just need to reverse the order of the concatenations. For example, a regular expression for the reversed integers is
(0|(0|1|2|3|4|5|6|7|8|9)*(1|2|3|4|5|6|7|8|9)(|-)).
For practical purposes it’s useful to have more operations available than just union, concatenation and Kleene star. We didn’t include those extra operations in the definition to avoid needing to prove the corresponding closure properties of regular grammars.
In the IEEE standard +
is used for Kleene plus, i.e. a
concatenation with at least one member. ?
is used for at
most one member, but possibly zero. Also explicit ranges are allowed,
denoted by numbers in braces. For example
(0|1|2|3|4|5|6|7|8|9){3,5}
would indicate a string of at
least three and at most five digits. Character ranges are also allowed,
indicated by brackets, so three to five digits could also be represented
as [0-9]{3,5}
. Of course this requires +
,
?
, braces and brackets to be special characters, which then
have to be preceded by backslashes in order to represent themselves.
There are a few other similar extensions. If you’re using regular
expressions for pattern matching, and you really should if you have to
do pattern matching, then you should consult the documentation for
whatever library you’re using, both to see what extensions are available
and to see which characters are special. Even if you will never use an
extension you may need to know about it if it makes certain characters
special and therefore requires you to precede them with backslashes.
The extensions above are a matter of syntactic convenience. They don’t change the set of languages which can be represented; they just make it possible to represent some languages with shorter regular expressions. There are other extensions in many implementations which change the set of representable languages. The new languages which these allow cannot be generated by regular grammars. None of what I say in this chapter about regular languages can be assumed to apply to the languages described using these extensions.