This is the html version of the file http://homepages.feis.herts.ac.uk/~comrcml/IJSTarticle.pdf.
G o o g l e automatically generates html versions of documents as we crawl the web.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:ZWrNDxCdyIwJ:homepages.feis.herts.ac.uk/~comrcml/IJSTarticle.pdf+subtitling+service&hl=en


Google is not affiliated with the authors of this page nor responsible for its content.
These search terms have been highlighted: subtitling service 

Speech-Based Real-Time Subtitling Services
Page 1
INTERNATIONAL JOURNAL OF SPEECH TECHNOLOGY 7, 269’Äì279, 2004
c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.
Speech-Based Real-Time Subtitling Services
ANDREW LAMBOURNE
SysMedia Ltd. Riverdale House, 19-21 High Street, Wheathampstead, Herts., UK
Andrew.Lambourne@sysmedia.com
JILL HEWITT, CAROLINE LYON AND SANDRA WARREN
School of Computer Science, University of Hertfordshire, Hatfield, Herts., UK
J.A.Hewitt@herts.ac.uk
Abstract. Recent advances in technology have led to the availability of powerful speech recognizers at low cost
and to the possibility of using speech interaction in a variety of new and exciting practical applications. The purpose
of this research was to investigate and develop the use of speech recognition in live television subtitling. This
paper describes how the ’ÄúSpeakTitle’Äù project met the challenges of real time speech recognition and live subtitling
through the development of a customisable speaker interface and use of ’ÄòTopics’Äô for specific subject domains. In
the prototype system (described in Hewitt et al., 2000; Bateman et al., 2001) output from the speech recognition
system (the IBM ViaVoice
R
engine) is passed in to a custom-built editor from where it can be corrected and passed
on to an existing subtitling system. The system was developed to the extent that it was acceptable for the production
of subtitles for live television broadcasts and it has been adopted by three subtitle production facilities in the UK.
The evolution of the product and the experiences of users in developing the system in a live subtitling environment
are considered, and the system is analysed against industry standards. Ease-of-use and accuracy are also discussed
and further research areas are identified.
Keywords: real-time subtitling, speech recognition, language models
1. Introduction
The value of subtitles in providing access for the
hearing-impaired audience of television programmes
has long been recognized and is reflected in legislation
in the US and Europe (see for example references: UK
legislation, 1990, 1996, 2001; US legislation, 1990,
1996, 2001). As broadcasters seek to fulfil the man-
dated increases in subtitling coverage, more and more
subtitling is being created in real-time for ’Äúlive’Äù or ’Äúas
live’Äù television. This presents technical and editorial
challenges, as well as the problem of finding suitably
skilled subtitlers. As an alternative to different kinds of
fast keyboard devices, the technique of ’Äúre-speaking’Äù a
subtitle commentary into a speech recognizer has been
investigated. This paper describes a project to assess
the feasibility of this method and to develop a practi-
cal system for speech-based live subtitling. Following
a general overview of initial work in the live subti-
tling field, the paper goes on to describe the SpeakTitle
project, in which speech recognition technology is used
to produce real-time subtitles for live broadcasts in the
UK (Hewitt et al., 2000; Bateman et al., 2001).
1.1. Background
Technology to enable transmission of ’Äúclosed cap-
tions’Äù (subtitles which are only visible on the pic-
ture with the aid of special decoder circuitry) was de-
veloped independently in the US and the UK during
the 1970s (reference: US standards, 1991, 1992, 1999,
2000 and history; UK standards, 1975, 1976). In 1982,
the National Captioning Institute in the US started

Page 2
270
Lambourne et al.
producing real-time captions for live programmes us-
ing specially trained court reporters to input the text as
phonetic codes on special stenographic keyboards (ref-
erence: National Captioning Institute). The codes were
converted back into conventional text using transcrip-
tionsoftwarewithphonetic-to-Englishdictionaries(for
general information, see reference: Stenograph).
The teletext standard adopted in the UK and Europe
(reference: UK Standards, 1975, 1976) provided for
closed captioning or ’Äúoptional subtitling’Äù by reserving
one or more page numbers to carry subtitle services.
The first regular UK subtitling service was launched on
Independent Television in 1979, and regular live news
subtitling started in 1987. Alternative approaches for
real-time input implemented in the UK involved using
normal QWERTY keyboards (up to around 90 wpm
sustained) or Velotype syllabic chord keyboards, on
whichtwoormorekeysmaybepressedsimultaneously
(up to around 120 wpm sustained), to provide an edited
version of the spoken word (for general information,
see reference: Velotype). Specialist techniques such as
the use of ’Äúshortforms’Äù (automatically expanding pre-
defined abbreviation codes for the names of people and
places and hard-to-type words) were developed. Multi-
plexing two slower operators in tandem, each transcrib-
ing alternate utterances, allowed high quality ’Äúedited’Äù
real-time subtitling to be produced.
The problems with this approach are that it is labour-
intensive (two multiplexed operators are needed for
best results using QWERTY or Velotype), there is a
shortage of trained Velotype operators and training
times are extended (one year or more), and it does not
easily suit all programme types (for example, sports
where there are many different player names). As an al-
ternative, stenographic keyboards and associated tran-
scription software were also adopted in the UK dur-
ing the 1990s. However, the use of stenography is also
problematic for the reason that operators are in short
supply and require 2 or 3 years of training and experi-
ence to achieve the necessary speed and accuracy.
1.2. Possibilities for Speech Input
Since the early days of real-time subtitling, it had been
foreseen that speech recognition could eventually pro-
vide a viable future text input modality. While afford-
able recognition technology was still in its infancy,
Damper, Lambourne and Guy had in 1985 proposed
using speech input as an adjunct to keyboard entry in
television subtitling (Damper et al., 1985). The sys-
tem used a series of simple restricted speech com-
mands to control the position and style (principally
colour) of live subtitles entered on a QWERTY key-
board, thus enabling the operator to focus maximum
effort on text entry. Early trials demonstrated the dif-
ficulty of using speech and keyboard input simultane-
ously at the same workstation due to keyboard noise
affecting recognition.
Once speech recognition technology reached the
point that an affordable system could deliver near real-
time transcription of continuous speech from a trained
speaker, it was worth seriously investigating its appli-
cation to live subtitling. Production of acceptable sub-
titles by direct recognition from the TV soundtrack was
judged to be not feasible for a number of reasons: in-
terference from background music or noise; likelihood
of multiple simultaneous speakers; the need for highly
accurate speaker-independent recognition in real time;
and the need to control style and position. The tech-
nique of speech input by a trained ’Äúediting re-speaker’Äù
(hereafterreferredtoastheSpeaker)wasthereforecho-
sen. The Speaker would be trained in the use of the
speech recognition system, would train the recognition
system to recognize his/her voice, and would develop
vocabulary appropriate to the expected subject matter.
The purpose of this project was therefore to investigate
whether and how real-time speech input could be used
to create acceptable subtitles for live TV broadcasts,
and to create the necessary interface tools to facilitate
it.
The research project was initiated in 1998 between
a broadcast software development company Synap-
sys Ltd. (now SysMedia Ltd.) and the University of
Hertfordshire, under the auspices of a DTI (Depart-
ment of Trade and Industry) LINK scheme (refer-
ence: LINK, 1998). This was a government-sponsored
scheme which partially funded broadcasting technol-
ogy initiatives.
1.3. Project Goals
This use of speech recognition technology particu-
larly focuses on the problems of real-time recognition
(i.e., the speech engine must deliver a transcript with
minimal delay) and high accuracy. This can be sup-
plemented by having a second operator to catch last-
minute errors and rapidly correct them. It was hoped
that the technique would enable operators to be se-
lected and trained far more rapidly than for the fast
keyboard technologies. If quality criteria were met, the

Page 3
Speech-Based Real-Time Subtitling Services
271
use of speech recognition would provide an alternative
to keyboard technologies for entering text in real-time
for live subtitling purposes’Äîand indeed for other tran-
scription areas.
The tasks and goals were defined as follows:
’Ä¢ assess the suitability of speech input for various pro-
gramme genres
’Ä¢ assess different speech engines and choose a suitable
candidate
’Ä¢ devise a suitable software interface to the selected
speech engine
’Ä¢ devise a suitable user interface for the speech subti-
tling system
’Ä¢ reach an acceptable quality threshold in the recog-
nized text
’Ä¢ cope with changing vocabulary such as names of
sports teams
’Ä¢ deliver subtitle text with an acceptably low through-
put delay
During the course of the research and development
different models for the production and correction of
text were assessed. Initially it was proposed that two
people would be needed’Äîone listening, editing if nec-
essary and re-speaking (the Speaker) one correcting
any errors (the Corrector)’Äîbut by the end of the work
the recognition accuracy for suitable programme ma-
terial was judged to be high enough to dispense with
the Corrector. Subtitles were typically presented in
’Äúscrolling mode’Äù rather than the traditional ’Äúblock
mode’Äù in order to minimise the delay between an ut-
terance and the appearance of the words on-screen.
The results of the work are embodied in a new prod-
uct called ’ÄúSpeakTitle’Äù which utilizes a commercial
recognition engine to produce real-time teletext subti-
tles for live programmes. It needs to be operated by a
trained Speaker, who in turn has trained the recogni-
tion engine to recognize his/her voice, and in a suitably
quiet acoustic environment. In addition, performance
is enhanced if the recognition engine is supplied with
topic-specific vocabulary files, since different topics
will not only use specific words but they may use com-
binations of words in different ways. Given these provi-
sions, SpeakTitle is being used successfully to subtitle
avariety of sporting events and other live programmes.
It has enabled television companies to widen the pool
of real-time subtitle production staff and thus increase
the potential for subtitling an increased number of live
broadcasts.
This paper describes how the project met the chal-
lenges of real-time speech recognition and live sub-
titling. It outlines how the project was adapted to
achieve television guidelines on subtitle acceptability
(see Section 1.4) as well as the philosophy of the ap-
proach adopted. The paper also describes the evolution
of the product and the experiences of users in develop-
ing the system in a live subtitling environment.
1.4. Design Criteria for Speech-Based Subtitling
In order to set up and operate a service delivering
speech-based TV subtitles that will be useful to view-
ers, general criteria for successful real-time subtitling
need to be met (reference: ITC, 1999). Such criteria are
not hard-and-fast, since it is clear that by definition the
production of subtitles in real time cannot be perfect
since text entry will always take a finite time, and there
is little opportunity to correct errors.
The general quality and operational criteria were:
(1) Accuracy: the recognition accuracy needs to be
around 97’Äì98%. Assuming an average of 14 words
in a 2-line subtitle, this still equates to an error
roughly once in every three or four such subtitles.
(2) Throughput: where picture events and sound are
tightly in synchrony (ignoring lip-sync: focusing
on subject-matter) a throughput delay of more
than about 5’Äì6 seconds can be problematic for the
viewer.
(3) Style control: if multiple speakers are being subti-
tled, it is desirable to be able to control the colour
of individual subtitles in order to reflect speaker
identity.
(4) Position control: if the centre of visual interest can
vary its position on the screen, then it is desirable
for subtitles to be moved to avoid obscuring this
area.
(5) Ease of use: to sustain an economic service, it is
important to be able to select and train new staff to
be subtitle Speakers relatively straightforwardly.
Fast keyboard operators can take around 1 year
(Velotype) or 2’Äì3 years (Stenograph) to train and
becomeproductive;speechsubtitlerswouldideally
be productive in a matter of 2’Äì3 months.
(6) Flexibility: to be able to respond to changing sub-
jectmatter,itshouldbepossibletoaddnewspecial-
ist topics and new vocabulary with the minimum
of overhead and complexity.

Page 4
272
Lambourne et al.
2. The SpeakTitle System
A range of speech recognition systems were investi-
gated, with reference to the results of the 1997 National
Institute of Standards and Technology (NIST) evalua-
tions (Pallett et al., 1997). It was found that recognition
results were most favourable for the IBM ViaVoice Ex-
ecutive system with recognition rates of 95’Äì98% con-
sistently recorded in trials based on trained Speakers
reading text at 150 wpm, where there were no out-of-
vocabulary words (reference: ViaVoice
R
).
The original SpeakTitle system was designed for use
by two operators, the Speaker and the Corrector. The
Speaker listened to the live programme and spoke the
subtitle text while the Corrector corrected the output
before it went on air. It was subsequently found that an
experienced Speaker could achieve recognition rates
without correction that were acceptable for live broad-
casts, and the systems currently in use do not utilize a
Corrector.
2.1. System Overview
A development system was built which incorporated
a video recorder so that television programmes could
Figure 1. The design of the SpeakTitle development system.
be repeated for experimental purposes. An overview of
the development system is given in Fig. 1. It is designed
to be operated by a Speaker and a Corrector.
A television programme is played on the video
recorder,simulatingalivefeedfromthetelevisioncam-
eras used in the operational system. The programme is
transmitted to the Speaker via his workstation and it is
also output on a TV monitor accessible by the Correc-
tor. The Speaker listens to the television programme on
a headset, and can watch the programme in a video im-
age window on the Speaker workstation. The Speaker
repeats what has been said, s/he may need to apply a
degree of editing and precis in cases where the tele-
vision programme includes rapid streams of speech
which cannot not be converted to subtitles in an ac-
ceptable timeframe.
In order to make the subtitles acceptable to the view-
ers, they must contain punctuation, and the Speaker
may either speak punctuation or utilize a touch
screen monitor (the Speaker Interface, described in
Section 2.2) to generate the punctuation words which
are incorporated into the audio stream prior to it be-
ing passed to the recognition engine. The output from
the Speaker thus comprises a single stream incorporat-
ing the spoken subtitles augmented with punctuation
commands. This output can be split, with one stream

Page 5
Speech-Based Real-Time Subtitling Services
273
passed directly to the ViaVoice recognition engine in
the Corrector’Äôs workstation, and one to the Corrector’Äôs
headset’Äîthis allows the possibility of inserting a delay
in the stream to the Corrector which can be adjusted
so that the Corrector hears the output from the Speaker
simultaneously with it showing on the computer moni-
tor. The delay corresponds to the amount of time taken
for the voice recognition engine to process the speech.
The Corrector watches the output on the computer
monitor and corrects any errors before the subtitles are
passed on to an existing subtitling system such as Win-
CAPS (reference: WinCAPS, 2003). This uses a tele-
textinsertercardtomergethesubtitledatawiththetele-
vision video signal for decoding on an ordinary teletext
TV set.
The Speaker Interface, described in detail in 2.2,
gives the Speaker various options such as the use of
buttons for punctuation and macros rather than speak-
ing commands as is necessary with the basic speech
recognition system. The Corrector Interface presents
the text in a scrolling window from which it can be
edited in a short time frame before going out ’Äòon air’Äô.
This interface incorporates software to amend the out-
putaccordingtothepresenter’Äôs’Äòhouse-style,’Äôforexam-
ple the use of upper or lower case characters for words
such as ’ÄúNorth-East’Äù and the addition or removal of
hyphens. It is also possible to filter out any offensive
words that might otherwise escape the notice of the
Corrector.
The Corrector Interface incorporates three scrolling
modes’Äî’Äúelastic,’Äù where text that is being edited is
not sent out until the edit is completed, ’Äúsemi-elastic’Äù
which allows text to be delayed for only a limited
amount of time, and ’Äúbulldozer’Äù where text is sent
out continuously, edited or not. A variation of this last
mode, the ’Äúpass-through’Äù sends text out directly with-
out any editing, although house-style changes are still
applied. This has been employed successfully in live
subtitling of snooker with reported error rates of only
2’Äì3%. It is recognized, however, that for faster paced
programmes, and ones where a high degree of accu-
racy is required, such as parliamentary interviews, a
Corrector might be required.
To deal with the situation in which one Speaker
is re-speaking the words of more than one television
speaker, such as in an interview situation, the possi-
bility of colour coding the text output needed to be
explored as a way to identify the different speakers to
the viewer. As only one human operator is responsi-
ble for both listening to and then repeating material to
be transcribed into the speech recognizer the input to
the speech recognizer does not convey the information
required to colour-code the subtitles automatically. Au-
tomatic colour-coding using the original speech would
require two recognition systems (original speech and
Speaker) to operate independently. They would need
different and varying operational speeds, different in-
puts in terms of speakers’Äô voices, and, in some in-
stances,intermsofthetextcontent.Asaresult,itwould
not be possible to synchronise the output of the speech
recognizer with that of the colour coding (speaker dis-
crimination) system.
Two techniques have been developed to deal with
this. The first method uses special speech ’Äúmacros’Äù
which produce commands that can be interpreted by
the subtitling system. The second uses buttons on the
Speaker interface.
A further development which has been designed
to improve the accuracy of the system is the use of
’ÄòTopics’Äô for specific domains (described in 2.3). Here,
accuracy has been improved by integrating specialised
language models into the system. As reported above,
acceptable recognition results for the IBM ViaVoice
system were achieved with trained Speakers where
therewerenoout-of-vocabularywords.TheSpeakTitle
system aimed to address the treatment of out-of-
vocabulary words using specialist language models or
’ÄúTopics’Äù.
In the speech recognizer, the acoustic processor pro-
duces a set of ranked candidate words from the acoustic
signal. The language model then provides information
on the probability of a given word or phrase occurring
in the context, and these two sources of information are
combinedtogivetheoutputwords.Inspecificdomains,
such as a particular profession, specialized vocabular-
ies are used. In domains such as a sports commentary,
the vocabulary may be largely unchanged from general
speech, but certain word patterns are likely to occur,
for example ’Äúon the black’Äù is a typical phrase heard in
snooker commentary but seldom anywhere else.
2.2. The Speaker Interface
A prototype Speaker Interface has been developed
which comprises several components (see Fig. 2).
The Video Image window shows the live television
broadcast (or, when used in test or development mode,
a simulation from a video recording). The subtitle po-
sition selectors allow the Speaker to position the broad-
cast subtitle at the top or bottom of the picture so

Page 6
274
Lambourne et al.
Figure 2. The Speaker Interface.
that it avoids any captions that are already present on-
screen. The Open Microphone Indicator informs the
Speaker when the system is ready to accept speech.
The Start/Stop button activates and deactivates the sys-
tem. The Delay Duration Slider allows the Speaker to
pre-set a delay in the audio signal that is sent to the
Corrector. This is a key feature, since it permits bet-
ter synchronisation between what the Corrector hears
(i.e. a delayed version of the dictation), and the corre-
spondingappearanceoftherecognisedtextoutputfrom
ViaVoice which they are to review and correct. With the
development system this delay was typically tuned to
between one and two seconds. The Command Buttons
allow the Speaker to insert punctuation or macros into
the audio signal that is sent to ViaVoice. These buttons
can be customised by the user and typically include
common punctuation and macros to change the colour
of subtitles.
The Delay Mechanism and Command Inserter were
originally implemented as two separate pieces of
hardware. The delay was provided by a broadcast
profanity delay’Äîan expensive piece of equipment de-
signed to be used primarily to avoid swearing going
out on the air in live situations. The SpeakTitle system
used it in a mode in which it provided a fixed dura-
tion delay. The commands were generated by a laptop
computer running a simple program that played back
a pre-recorded audio file from disk whenever a pre-
determined key was pressed.
Both the Delay Mechanism and the Command In-
serter functions are now implemented in software on a
single desktop computer. Figure 3 shows the multiple
buffer mechanism that is employed. This consists of a
capture buffer that records a mono audio signal, a Cor-
rector playback buffer and a Speaker playback buffer.
The Corrector playback buffer has its output panned to
the left channel of the computer’Äôs stereo audio output
and the Speaker playback buffer has its output panned
to the right channel. This allows the audio outputs of
these buffers to be routed to different destinations. An
array of command sample buffers is created to hold the
pre-recorded punctuation and macro audio command
files.
When the system is active, the Speaker’Äôs speech is
recorded in the capture buffer. The capture buffer is
divided into segments. The size of the segment can be
varied, but the minimum size is limited by the speed
of the host computer. Tests with a 733 MHz Pentium
III show that four increments per second is the most
that can be achieved without loss of continuity. As
soon as a segment has been recorded, it is immedi-
ately copied into the Corrector playback buffer and the
Speaker playback buffer.
If the Speaker playback buffer is not already play-
ing, then playback is started. Playback of the Corrector
playback buffer only commences when the buffer is
full, hence the length of this buffer determines the du-
ration of delay to the signal that the Corrector hears. All
the buffers wrap around. Once they are full, the inser-
tion of data is started again at the beginning, and when
the record and playback ’Äúheads’Äù reach the end of the
buffer, they are immediately returned to the start. When
the Speaker presses a command button, the system is
notified. The next two copy operations for the Speaker
playbackbufferarethensourcedfromtherelevantsam-
ple buffer, not the capture buffer. In this way, the com-
mand output is passed to ViaVoice, but the Corrector
does not hear it.
2.3. ’ÄúTopics’Äù’ÄîSpecialized Language Models
The use of specialized language models can improve
the accuracy of the speech recognizer. The main
language model is an integral part of any speech

Page 7
Speech-Based Real-Time Subtitling Services
275
Figure 3. The Delay Mechanism and Command Inserter.
recognizer.Ittakesthecandidatewordsfromtheacous-
ticprocessorandrankspossibleoutputinorderofprob-
ability. Thus ’Äúa tax on petrol’Äù would be more likely
than ’Äúattacks on petrol.’Äù In order to do this two com-
ponents have to be processed. First there is the set of
single words that comprise the vocabulary of the lan-
guage model (LM). The second component of the LM
is the set of trigrams, three adjacent words, that define
contexts in which words are likely to occur. These two
components constitute the training data for the LM.
Now, the main language model can be augmented by
a specialized LM, also called a Topic, developed for
a particular subject field, and then integrated with the
main LM. Via Voice has the facility to develop such
Topics,andtheyhavebeensuccessfullyusedinthesub-
titling of live TV sports programmes’Äîsnooker, golf,
tennis, football (soccer), athletics. The Topic can be
switched in or out as required.
In customizing a language model by integrating a
Topic there are two steps. First, single words have to
beaddedtothevocabulary.Topicsdevelopedforpartic-
ular professional use, such as legal or medical domains,
would have appropriate technical terms. In sporting do-
mains the names of players are specific to each sport
and need to be added to the vocabulary. This may be
done by the user who can update the main vocabulary
right up to the last minute. Some names can be accepted
through spoken input, but it may be necessary to use
a phonetic transcription facility, described below, par-
ticularly for foreign names. However, if the addition is
made to the main vocabulary it will then be a perma-
nent word there. If names are added to the Topic, which
is prepared in advance, they can be used when relevant
only, as the Topic can be switched in and out.
The second component of the Topic is the domain
specific set of trigrams. In sporting domains, the spe-
cialized vocabulary is often quite limited, apart from
names of players. Few words occur that are not in
the base vocabulary of the speech recognition sys-
tem. However, there are characteristic combinations
of words that seldom occur in general text or speech.
For instance in football commentary trigrams like ’Äúa
free kick,’Äù ’Äúhit the post,’Äù ’Äúyellow card for’Äù occur fre-
quently, yet they did not arise in 2.5 million words
of TV chat show speech. In tennis we need to avoid
solecisms such as ’Äúnumber to court’Äù. The words them-
selves in these examples are not peculiar to the football
or tennis topic, but their combinations are. In speech
recognition the ’Äúsparse data problem’Äù is a key issue
(Gibbon et al., 1997; Ney et al., 1997). This is the ob-
served phenomenon that a small number of words oc-
cur frequently, but most words occur rarely’Äîa zipfian
distribution. This phenomenon is more pronounced for
bigrams and trigrams. For instance in 39 million words
from the Wall Street Journal, 77% of the trigrams have
only occurred once. If a new passage in this limited do-
main is considered, most of the trigrams will not have
occurred before (Gibbon et al., 1997, p. 258). By mod-
elling the domain characteristics, we begin to address
this problem.

Page 8
276
Lambourne et al.
In developing Topics for sports, the relevant trigrams
for training were obtained from data supplied by the
broadcasters. Transcribed commentaries that had pre-
viously been broadcast were taken and processed into
sets of trigrams. As much data as possible was col-
lected, with a target of a million words, though we
usually had to make do with less. This is an arbitrary
figure, based on empirical investigations, but in gen-
eral larger amounts of training data are desirable so
that more trigrams can be captured. The training data
is then cleaned, if necessary. Cleaning can include ex-
cising extraneous comments and removing glaring er-
rors that are not wanted as models. The IBM Via Voice
Topic Factory tool is used to process the training data,
and produce the Topic which can be integrated with the
main language model.
2.3.1. Example from Commentary on Snooker. To
illustrate these issues, we report on an example from
live TV commentary on snooker. The data used is from
stenographers’Äô broadcast output, produced in real time,
and therefore occasionally noisy. The figures given are
rounded to avoid spurious precision. A small corpus of
snooker commentary, 59 K words, has been compared
to a base corpus of 2.5 million words from TV chat
shows. A control experiment with another 59 K corpus
from new chat shows is also used. We see how many
single words, word pairs and word trigrams are unique
to the snooker corpus and compare these figures to
those from the control corpus (Table 1).
If we exclude names, numbers and errors, there are
nineteen words in the snooker corpus that did not oc-
cur in the base corpus, e.g. terms like ’Äúmissable’Äù and
’Äúpottable.’Äù Compared to the control corpus, there are
few words that are in the snooker corpus but not in the
base corpus. However, there are a comparable number
of new trigrams. Phrases like ’Äúbehind the red’Äù or ’Äúthe
opening pot’Äù occur frequently. There are characteristic
constructions, typical of the domain, that can be ex-
ploited. See Table 2, but note ’Äúwords’Äù include names,
numbers and errors.
Table 1. Statistics of the corpora (including as words
names, numbers and errors).
Base corpus
Snooker
Control
Number of words
2,592 K
59 K
59 K
Distinct words
53 K
4 K
7 K
Distinct bigrams
725 K
27 K
34 K
Distinct trigrams
1737 K
48 K
51 K
Table 2. Numbers (rounded) of all words, bigrams and trigrams
not in base corpus (including as words names, numbers and errors).
Snooker
Control
Words, frequency ’â€2, not in base corpus
90
270
Bigrams, frequency ’â€2, not in base corpus
1200
1300
Trigrams, frequency ’â€2, not in base corpus
2000
2300
The training data are presented to Topic Factory, and
any words that need phonetic representations are iden-
tified. The developer may be able to get a correct pho-
netic representation accepted by speaking the word.
Otherwise, the phonetic representation has to be typed
in, using a mapping in which phonetic symbols are
mapped onto ASCII letter combinations. After this is
done, Topic Factory will process the prepared training
data to produce the customized LM.
In the past, language models have typically been
evaluated by assessing the perplexity of the model
on test data. However, recent work indicates that di-
rect word error rates may be a more useful metric
(Clarkson and Robinson, 1998), and that is the measure
utilized here, using the CRER (Composite Record of
Errors in Recognition) analyzer described below. Pre-
liminary experiments indicate improvements of 1’Äì3%
when adding a sporting Topic.
Other work in this field, such as the development
of story topics for stories in similar domains, has used
single word similarities as a basis for clustering texts
(Seymour and Rosenfeld, 1997). The snooker example
illustrates how the use of word combinations such as
trigrams is more powerful.
3. The System in Use
The CRER assessment tool was developed to enable
more consistent measurement of accuracy, clearer rep-
resentation of results and the ability to make a more
detailed investigation. This enabled an analysis of the
Speakers’Äô performances under different conditions.
This analysis and the experience of Speakers work-
ing in a live environment have in turn contributed to
improvements in the SpeakTitle system.
3.1. CRER (Composite Record of Errors in
Recognition) Tool
The CRER software is an analytical tool which
compares scripts of actual spoken input and speech

Page 9
Speech-Based Real-Time Subtitling Services
277
recognized output. It provides measurements of overall
accuracy (defined as number of words minus substitu-
tions,deletionsandinsertions)andcorrectness(defined
as number of words minus substitutions and deletions),
aswellasarangeofdetailederrorstatisticsforsubstitu-
tions, deletions and insertions of words. An example of
the difference between accuracy and correctness can be
seen in the following illustration. If the word ’Äúaway’Äù is
recognized as ’Äúthe way’Äù this gives one substitution and
one insertion error. This would count as 2 errors on the
accuracy metric, one on the correctness metric. Differ-
ent types of errors are represented with different colour
codes for ease of analysis, and all results are shown as
percentages. This detailed and consistent analysis of
live subtitling results has provided a base line mea-
surement from which to test improvements made to
the system with such developments as topics and the
Speaker interface.
3.2. Accuracy
Preliminary tests were carried out to assess the initial
relative accuracy of the basic ViaVoice speech recogni-
tion system (i.e. without the use of topics or the Speak-
Title Speaker Interface developments) with a range of
untrained speakers and a range of input modes (the
speaker either reading text or hearing speech). One
objective of these tests was to assess whether certain
speakers were better suited to the front-end of the live
subtitling environment where abilities such as hear-
ing/reading,comprehensionandspeakingwouldbeuti-
lized. With the Speaker utilizing this variety of differ-
ent abilities in order complete the tests, it was felt that
certain characteristics might be highlighted as produc-
ing greater accuracy. Another objective of these tests
was to assess the basic accuracy level of a minimally
trained speech recognition system with inexperienced
Speakers, in order to identify the range of improve-
mentneededtobeachievedwiththeSpeakTitlesystem.
These tests were carried out at the start of the project,
in 1998, using an earlier version of ViaVoice than was
subsequently employed.
Fifteen untrained Speakers were asked to read out
loud strings of words where the input source was ei-
ther text input or audio input. The rate of speaking
for the audio input tests was gradually increased thus
permitting less time between the speaker hearing and
repeating the words. Not only did the findings show, as
might be expected, that the read text input provided the
greatest accuracy overall (92%) but that highest accu-
racy was found with audio input when the speakers had
more time between hearing and saying the words (aver-
age of 83%). However, the findings highlighted not so
much speaker variation or speaker suitability for online
subtitling, as was initially anticipated, but the impor-
tance of training. The recognition accuracy needs to be
around 97’Äì98% for a live-subtitling environment, and
the earlier quoted figures of 95’Äì98% being found us-
ing ViaVoice were reliant on both the level of ViaVoice
training and the avoidance of out-of-vocabulary words.
The development of the Speaker Interface together
with the use of Topics for the SpeakTitle system (de-
scribed above in 2.2 and 2.3), has not only improved
accuracy, and therefore the viability of automated sub-
titlingoflivebroadcasts,buthasalsoprovidedtheusers
of the system with a more manageable and easy-to-use
interface.
The system was adopted for use by a television com-
pany before the end of the project and a number of
dedicated speakers were engaged to make it opera-
tional for particular types of live programme, particu-
larly sport. Evidence that the system is now acceptable
comes from the commercial decisions made to use it to
replace the traditional methods of subtitling. It was not
possible to divert commercial operations away from
their prime functions to conduct objective tests except
on a very limited scale, but early reports on the accu-
racy of the systems in real-time live use with dedicated
and well-trained operators have already given figures
in the range of 98%.
3.3. Experience of Speakers
There were certain findings gained from the user per-
spective and the experience of running the system in
a live subtitling environment that have contributed to
further development of the SpeakTitle system to im-
prove ease of use and accuracy. For example, speaker
training at the speed of expected normal delivery and
without undue hesitation was found to give improved
recognition rates. Setting up the microphone before a
livesessionandconfiguringtheSpeakTitleforthrough-
put (i.e. setting transmission rate and maximum lines
of text visible on-screen for editing) before output to
transmission buffer have both improved accuracy.
It has been found that no special microphone or
sound card equipment is required to reach satisfac-
tory recognition levels. Most modern computers have
an adequate sound card that is compatible with the
SoundBlaster sound card standard, and the Andrea

Page 10
278
Lambourne et al.
NC-61 microphone shipped with ViaVoice gives as
good recognition as more expensive microphones in
normal situations. Whilst ViaVoice will take back-
ground noise into account when performing speech
recognition, a consistent background level, or ideally
a silent acoustic environment, gives better results. A
specialised Proximity Microphone is more suitable in
environments where there is variable local noise, such
as other speakers.
Throughput has been further improved with the fine-
tuningofthespeechsystemforbestthroughputbalance
in a number of ways: For example, setting an appropri-
ate words per minute (WPM) rate for delivery of subti-
tle text to WinCAPS and setting the maximum number
of lines in the Edit box so that the text does not build
up have both been effective. The identification at the
outset of which subtitle presentation style best meets
the needs of the transmission can provide benefits to
throughput. The subtitles may be displayed in block-
mode where the whole subtitle is displayed at once or
scrolling mode where the subtitle is displayed word by
word. On a practical level, and from the Speaker’Äôs per-
spective, having clear rules on error correction greatly
increases throughput. For example knowing to correct
only text that seriously detracts from the meaning in-
tended in the spoken text, rather than all errors, can
speed up the editing process.
Other SpeakTitle features enable the fine-tuning of
thesystemforspecifictreatmentofhomophones,short-
forms and the text processing required for specific in-
house styles pre-defined by the user. A greater or lesser
degree of editing can also be controlled by the system.
From the staff selection and training perspective a
’Äúgood’Äù Speaker has been defined as one that gives a
consistentdeliverywithfullypronouncedwords.Slight
accents seem not to lower recognition accuracy, and
tests completed with a range of Speakers from different
backgrounds (see 3.2 above) found little variation in
accuracy due to age, sex, ethnic origin, education level,
vocal tone, volume, experience of dictation, speed or
even first language other than English.
4. Conclusions
The main achievement of the project has been the de-
velopment of an effective working system for live-
subtitling. The SpeakTitle project has developed from
an initially basic system which utilized a standard
speech recognition package together with standard
speaker and correction modules to one which can be
finelytunedtotheuserrequirementsintermsof’Äòtopics’Äô
and speaker interface requirements. This has provided
a useable working system which is now used live on-air
for three major UK broadcasters, giving sufficiently ac-
curate results and providing hearing impaired viewers
with access to more live television.
The development of CRER as a tool for assessing
the performance of continuous speech recognizers has
been another valuable outcome from the project. It pro-
vides a consistent measure for assessing continuous
speech recognizers. Work is now under way to utilize
this tool more extensively to assess the on-air results of
more and more ’Äòlive subtitling’Äô which is being under-
taken by the two major users of SpeakTitle. The results
from this exercise will direct future work on further im-
proving the accuracy levels towards human perception
levels.
A number of future applications of and spin-offs
from the ’Äòlive subtitling’Äô technology developed for the
SpeakTitleprojectarebeingexplored.Thepresentation
on screen of the text of lectures, meetings and tele-
phone conversations are other areas for further eval-
uation, and tests are already underway to assess the
suitability for lecture situations. The obvious advan-
tages to the hearing-impaired of converting from audio
input to text output might also be applicable to viewers
who do not have English as their first language and who
may find text easier to follow than spoken words. The
lessons learnt during the process of achieving real-time
subtitling for English and the processes developed as
part of this project could be applied in the future to
languages other than English; work towards this end
has been done in Japan (reference: NHK). One further
application of the SpeakTitle technology could be in
the field of translation where a bilingual Speaker lis-
tening to output spoken in one language could provide
text output in another. These areas are being further
explored based upon the results of the CRER analysis
and experiences of running subtitling in a live setting
for major British broadcasters since summer 2002.
Acknowledgment
This research was carried out as part of LINK project
number GR/M15958/01 (LINK) under the Broadcast
Technology Initiative, partly funded by the DTI and
EPSRC in the United Kingdom. The main partners
were the University of Hertfordshire and Synapsys
Ltd. (now SysMedia), a company which specialises in
broadcast subtitling and digital information products.

Page 11
Speech-Based Real-Time Subtitling Services
279
All trademarks are the property of their respective
owners.
References
Bateman, A., Hewitt, J., and Lambourne, A. (2001). Subtitles
from Simultaneous Transdiction: Multi-modal Interfaces for Gen-
erating and Correcting Real-time Subtitles, HCII2001, New
Orleans.
Clarkson, P. and Robinson, T. (1998). The applicability of adaptive
language modelling for the broadcast news task. Proceedings of
ICSLP. Sydney, Australia, pp. 1699’Äì1702.
Damper, R.I., Lambourne, A.D., and Guy, D.P. (1985). Speech in-
put as an adjunct to keyboard entry in television subtitling . In
B. Shackel (Ed)., Proceedings Human-Computer Interaction’Äî
INTERACT’Äô84, pp. 203’Äì208.
Gibbon, D., Moore, R., and Winski, R. (Eds.) (1997). Handbook of
Standards and Resources for Spoken Language Systems. Mouton
de Gruyter., Chapter 7.
Hewitt, J., Bateman, A., Lambourne, A., Ariyaeeinia, A., and
Sivakumaran, P. (2000). Real-time speech generated subtitles:
Problems and solutions. 6th International Conference on Spoken
Language Processing ICSLP 2000. Vol. III.
ITC guidance on standards for subtitling (amended February 1999):
http://www.itc.org.uk/itc publications/codes guidance/standards
for subtitling/index.asp
LINK. (1998). The Use Of Speech Recognition In Live TV Subti-
tling, LINK Project No. GR/M15958/01, 1/10/1998’Äì30/9/2001.
Overview of LINK Project: http://homepages.feis.herts.ac.uk/’àº
nehaniv/idmf/abstracts/hewitt.doc
National Captioning Institute. http://www.ncicap.org/ acapintro.asp
Ney, H., Martin, S., and Wessel, F. (1997). Statistical language
modelling using leaving one out. In S. Young and G. Bloothoft
(Eds.), Corpus Based Methods in Language and Speech Process-
ing. Kluwer Academic.
NHK. (2002). http://www.nhk.or.jp/strl/open2002/en/tenji/id03/03.
html
Pallet, D.S., et al. (1997). Broadcast news benchmark test re-
sults: English and Non-English. Proc. DARPA Speech Recognition
Workshop 1997.
Seymour, K. and Rosenfeld, R. (1997). Using story topics for lan-
guage model adaptation. Proceedings of Eurospeech97.
Sivakumaran, P., Fortuna, J., and Ariyaeeinia, A.M. (2001). On the
use of the bayesian information criterion in multiple speaker de-
tection. Proceedings of Eurospeech2001.
Sivakumaran, P., Ariyaeeinia, A., and Fortuna, J. (2002). An effective
unsupervised scheme for multiple speaker detection. ICSLP2002.
Denver, Colorado, Topic 16.
Stenograph: http://www.stenograph.com
UK legislation:
Broadcasting Act 1990 (c. 42) Section 35, HM Stationery Office
UK.
Broadcasting Act 1996 (c. 42) Section 20(3)(a), HM Stationery
Office UK.
Statutory Instrument 2000 no 2378 : Broadcast (subtitling) order
2001, HM Stationery Office UK.
UK standards:
Unified Standard April 1974, BBC Engineering Sheet 4008(5),
Oct. 1975.
Joint IBA/BBC/BREMA Publication: Broadcast Teletext Specifi-
cation, September 1976.
US legislation:
Television Decoder Circuitry Act of 1990, US Congress.
Telecommunications Act of 1996, US Congress.
Federal Communications Commission Rule 79’ÄîClosed Caption-
ing of Video Programming, updated 2001.
US standards and history:
FCC Report and Order FCC 91-119 1991.
FCC Memorandum, Opinion and Order FCC 92’Äì157 1992.
EIA/CEA-608-B : Recommended Practice for Line 21 Data
Service, 31 Oct 2000, see http://www.ce.org/standards/
standard details.asp?id=270.
EIA-708-B for Digital Television Closed Captioning, 29 Dec
1999, see http://www.ce.org/standards/standard details.asp?
id=249.
Electronic Industries Association. Engineering Department,
20001 Pennsylvania Avenue, N.W., Washington, D.C. 20006.
http://www.robson.org/capfaq/caption-charset.html).
http://ncam.wgbh.org/resources/icr/line21hist.html
http://main.wgbh.org/wgbh/pages/mag/services/captioning/
Velotype:
http://www.velotype.com/
http://www/velotype.nl/
ViaVoice
R
: http://www.ibm.com/software/speech
WinCAPS: (2003) SysMedia Ltd. at http://www.sysmedia.com/
subtitling/pdfs/wincaps multimedia.pdf