Speech-Based Real-Time Subtitling Services
273
passed directly to the ViaVoice recognition engine in
the Corrector’Äôs workstation, and one to the Corrector’Äôs
headset’Äîthis allows the possibility of inserting a delay
in the stream to the Corrector which can be adjusted
so that the Corrector hears the output from the Speaker
simultaneously with it showing on the computer moni-
tor. The delay corresponds to the amount of time taken
for the voice recognition engine to process the speech.
The Corrector watches the output on the computer
monitor and corrects any errors before the subtitles are
passed on to an existing subtitling system such as Win-
CAPS (reference: WinCAPS, 2003). This uses a tele-
textinsertercardtomergethesubtitledatawiththetele-
vision video signal for decoding on an ordinary teletext
TV set.
The Speaker Interface, described in detail in 2.2,
gives the Speaker various options such as the use of
buttons for punctuation and macros rather than speak-
ing commands as is necessary with the basic speech
recognition system. The Corrector Interface presents
the text in a scrolling window from which it can be
edited in a short time frame before going out ’Äòon air’Äô.
This interface incorporates software to amend the out-
putaccordingtothepresenter’Äôs’Äòhouse-style,’Äôforexam-
ple the use of upper or lower case characters for words
such as ’ÄúNorth-East’Äù and the addition or removal of
hyphens. It is also possible to filter out any offensive
words that might otherwise escape the notice of the
Corrector.
The Corrector Interface incorporates three scrolling
modes’Äî’Äúelastic,’Äù where text that is being edited is
not sent out until the edit is completed, ’Äúsemi-elastic’Äù
which allows text to be delayed for only a limited
amount of time, and ’Äúbulldozer’Äù where text is sent
out continuously, edited or not. A variation of this last
mode, the ’Äúpass-through’Äù sends text out directly with-
out any editing, although house-style changes are still
applied. This has been employed successfully in live
subtitling of snooker with reported error rates of only
2’Äì3%. It is recognized, however, that for faster paced
programmes, and ones where a high degree of accu-
racy is required, such as parliamentary interviews, a
Corrector might be required.
To deal with the situation in which one Speaker
is re-speaking the words of more than one television
speaker, such as in an interview situation, the possi-
bility of colour coding the text output needed to be
explored as a way to identify the different speakers to
the viewer. As only one human operator is responsi-
ble for both listening to and then repeating material to
be transcribed into the speech recognizer the input to
the speech recognizer does not convey the information
required to colour-code the subtitles automatically. Au-
tomatic colour-coding using the original speech would
require two recognition systems (original speech and
Speaker) to operate independently. They would need
different and varying operational speeds, different in-
puts in terms of speakers’Äô voices, and, in some in-
stances,intermsofthetextcontent.Asaresult,itwould
not be possible to synchronise the output of the speech
recognizer with that of the colour coding (speaker dis-
crimination) system.
Two techniques have been developed to deal with
this. The first method uses special speech ’Äúmacros’Äù
which produce commands that can be interpreted by
the subtitling system. The second uses buttons on the
Speaker interface.
A further development which has been designed
to improve the accuracy of the system is the use of
’ÄòTopics’Äô for specific domains (described in 2.3). Here,
accuracy has been improved by integrating specialised
language models into the system. As reported above,
acceptable recognition results for the IBM ViaVoice
system were achieved with trained Speakers where
therewerenoout-of-vocabularywords.TheSpeakTitle
system aimed to address the treatment of out-of-
vocabulary words using specialist language models or
’ÄúTopics’Äù.
In the speech recognizer, the acoustic processor pro-
duces a set of ranked candidate words from the acoustic
signal. The language model then provides information
on the probability of a given word or phrase occurring
in the context, and these two sources of information are
combinedtogivetheoutputwords.Inspecificdomains,
such as a particular profession, specialized vocabular-
ies are used. In domains such as a sports commentary,
the vocabulary may be largely unchanged from general
speech, but certain word patterns are likely to occur,
for example ’Äúon the black’Äù is a typical phrase heard in
snooker commentary but seldom anywhere else.
2.2. The Speaker Interface
A prototype Speaker Interface has been developed
which comprises several components (see Fig. 2).
The Video Image window shows the live television
broadcast (or, when used in test or development mode,
a simulation from a video recording). The subtitle po-
sition selectors allow the Speaker to position the broad-
cast subtitle at the top or bottom of the picture so