Formal definition

Let \({ L A }\) be the set of lists all of whose elements are in \({ A }\). We’ll define this notion more precisely later but for now it suffices to note that lists are required to be of finite length, but could be of length 0. The set \({ A }\) is called the alphabet of the language and its elements are called tokens. Any subset of \({ L A }\) is called a language.

This definition of language is broad enough to include a wide variety of meanings which are commonly given to the word, including

programming languages like C, Python, Rust, LISP, Haskell, etc.
data description languages like (parts of) SQL,
file formats like .csv or .ini,
specialised single purpose languages like printcap config file entry syntax,
languages for mathematical logic like the ones we’ll use for zeroeth and first order logic.

It may not always be clear which category a language belongs to. In the introduction I introduced a single purpose language for module enrollment but it turns out to be equivalent to one of the languages used for mathematical logic, namely that of the propositional calculus. Similarly you might think of PostScript as a single purpose language for page description but it is also a full programming language capable of anything any other programming language is capable of. I’ve written PostScript code to solve ordinary differential equations and to compose Lorentz transformations. This isn’t as bizarre a thing to do as it might seem. If your aim is to produce nice diagrams and you have a language which can describe diagrams in a way every modern printer can understand and which is also a full programming language then why wouldn’t you just do everything in that language? The answer to that question, as it turns out, is that debugging PostScript code is very painful.

The definition above doesn’t really include natural languages, like English, Irish, Arabic, Japanese, or Toki Pona, used by humans for communicating for other humans. For those it’s often unclear whether particular lists of tokens are valid elements of the language. Subsets of natural languages are often used for communication between humans and computers though. The subset of a natural language that a given computer programme emits is almost always a language by the definition above. The subset it accepts is always one as well. Also, many of the concepts described below were first developed in the context of natural languages and only later was it noticed that they apply even better to languages used by computers.