Contact: Steve Renals, Department of Computer Science, University of Sheffield,
Sheffield S1 4DP, UK (firstname.lastname@example.org)
The project builds on the LVCSR technology developed by some of the THISL partners in the ESPRIT Long Term Research projects WERNICKE and SPRACH, the NLP technology developed by Thomson, and modern text retrieval technology. The principal objectives are:
Specifically, the project is concerned with the construction of a demonstration system which performs good recognition of broadcast speech in the domain of broadcast news and from which it can produce multimedia indexing data. The project is concentrating on British and American English applications, with work in progress on a French language system. The resultant system may be regarded as a ``news-on-demand'' application in which specific portions of a broadcast may be retrieved in response to a spoken request from the user.
Following discussions with BBC library staff, three application areas have emerged: archives, newsroom systems and programme workgroups. The newsroom application has become the main target area; this application also has a more manageable throughput (several hours of broadcast news audio per day). This application is also more time-constrained, compared with the others.
The principal expected result of this project is a demonstration that LVCSR can be used as the basis for a broadcast news on demand system. Experiments carried out during the first half of the project, primarily in the framework of the ARPA/NIST TREC Spoken Document Retrieval Evaluation programme , indicate that this is feasible even with current performance of speech recognition systems.
Several innovative intermediate results will be required to achieve this goal, including the transcription of broadcast speech, the development of audio-editing tools, such as hesitation detection and audio-track keyword search, content-based retrieval from audio and video archives, and a robust spoken language interface for search and retrieval of audio and video data.
In addition to the usual keyboard/mouse interface, we are developing a spoken query interface to the THISL system, in which the user may interact verbally with the system, to develop more refined queries.
These components will be brought together in a demonstration system using BBC news output, collected over the duration of the project. The overall architecture of the system is illustrated in figure 1.
The Abbot acoustic model is based on a recurrent network  that is used to estimate the local posterior probability of each phone given the acoustic data. This approach, which differs from the likelihood estimation of traditional HMMs has a number of benefits, including the use of posteriors to help prune the search space  and more compact acoustic models.
Abbot has been trained on British and American English broadcast speech, and a similar system for French is under development. The American system is trained on 100 hours of broadcast speech data; the British English system is trained on around 30 hours of transcribed BBC television and radio news, collected as part of this project. Both systems use language models collected from text databases containing several hundred million words, obtained from broadcast transcriptions and newspaper/newswire text.
The same recognizer is used for both broadcast speech recognition and in the spoken query interface. The spoken query recognizer must be able to recognize both a ``meta-language'' (eg ``I am looking for a clip about...'', ``Have you got anything on...?'') with a vocabulary of a few hundred words and the ``keywords'' used for retrieval. But since we are performing full text retrieval, the set of ``keywords'' is simply the vocabulary of the broadcast news recognizer. The language model is, of course, different when processing spoken queries; however, our current strategy is to process all speech with the same language model, postprocessing spoken query word-graphs with an NLP backend.
Acoustic segmentation alone is not adequate to define ``documents'' for the IR component. The simplest approaches to text segmentation that we are investigating define a document as a window of n words, while more sophisticated approaches build statistical topic models of text.
The full-text retrieval approach adopted in the THISL system is based on a probabilistic model of information retrieval, in which (loosely speaking) the match between a query and a document is based on a count of matching words (or terms). Stop words (eg, the, to) which do not aid discrimination between relevant and non-relevant documents are not considered and a stemming algorithm is used to transform each word into its stem. A simple count is too crude, so term weighting is employed, which includes additional terms relating to the document length and the number of documents in which a term occurs. We use a term weighting scheme similar to that which has been developed successfully by the Okapi group at City University .
In the speech retrieval case the problem is complicated by the fact that the document archive is produced by a speech recognizer. In the case of broadcast news, this means that word error rate is likely to be in the region of 15-40%. An important aim of the THISL project is to investigate the effect of speech recognizer word error rate on retrieval performance (see section 5), and to develop IR algorithms that take into account such uncertainty. Our approaches have been to add terms to the document through the use of lattices or n-best lists, to investigate the use of acoustic confidence measures in the IR document-term weighting function, and to investigate the use of query expansion as a way of adding robustness to possible speech recognition errors. In particular, query expansion  makes use of a secondary corpus (in our work this is typically a corpus of newswire text from a similar time period to the primary broadcast corpus). The query is expanded by assuming that the top n documents retrieved from the secondary corpus are relevant, and using a weighted mutual information metric (or similar) to find those terms that are candidates to expand the query. The advantage of this approach is that it enables written text to be used to produce an expanded query that is then applied to the broadcast archive.
The information retrieval component can use both a typed and a spoken query interface. In addition to the obvious advantages of a speech-based interface, the natural language processing component would enable more sophisticated query processing (eg determining whether a word is being used as a meta-word or as a keyword).
Evaluation of IR systems is important, but requires considerable infrastructure. Developing the North American system has enabled us to evaluate our technology within the NIST TREC (Text Retrieval Conference) programme of evaluations . In particular, the THISL system has participated in the TREC-6 and TREC-7 spoken document retrieval evaluations. The TREC-7 evaluation was based on an archive of 100 hours of North American broadcast news. Using an Ultra 1/167 we were able to decode this speech data at around eight times realtime, with an estimated relative search error or 10-20%, resulting in an overall word error rate (WER) of about 35%. The interpolated recall-precision curve for the THISL system in this evaluation is shown in figure 2. This evaluation enabled cross-recognizer comparisons to assess the effect of different recognizer performance, and these results are also shown in figure 2.