The THISL Spoken Document Retrieval Project

Contact: Steve Renals, Department of Computer Science, University of Sheffield,
Sheffield S1 4DP, UK (s.renals@dcs.shef.ac.uk)

Abstract:

THISL is an ESPRIT Long Term Research Project focused on the retrieval of multimedia information (primarily written or spoken text) using a spoken language interface. The project is concerned with the construction of a demonstration system which performs good recognition of broadcast speech from TV and radio news programmes and the production of multimedia indexing data from this. The project concentrates on British and American English applications, with work in progress targeting a French speech recognition application. At the midway point of the project, we have constructed a prototype system, based on an archive of 100 hours of North American broadcast news. This has been successfully evaluated within the framework of the TREC-6 and TREC-7 spoken document retrieval tracks. By early 1999, we expect to have a second prototype system based on an archive of several hundred hours of BBC news output.

Participants

THISL is a three year project that began in February 1997. The project partners are:

   
Aims and Objectives

The THISL project is concerned with the integration of current state of the art Large Vocabulary Continuous Speech Recognition (LVCSR), Information Retrieval (IR) and Natural Language Processing (NLP) technologies, focused on a target application consisting of the automatic indexing and retrieval of broadcast television and radio news programmes.

The project builds on the LVCSR technology developed by some of the THISL partners in the ESPRIT Long Term Research projects WERNICKE and SPRACH, the NLP technology developed by Thomson, and modern text retrieval technology. The principal objectives are:

Specifically, the project is concerned with the construction of a demonstration system which performs good recognition of broadcast speech in the domain of broadcast news and from which it can produce multimedia indexing data. The project is concentrating on British and American English applications, with work in progress on a French language system. The resultant system may be regarded as a ``news-on-demand'' application in which specific portions of a broadcast may be retrieved in response to a spoken request from the user.

Following discussions with BBC library staff, three application areas have emerged: archives, newsroom systems and programme workgroups. The newsroom application has become the main target area; this application also has a more manageable throughput (several hours of broadcast news audio per day). This application is also more time-constrained, compared with the others.

The principal expected result of this project is a demonstration that LVCSR can be used as the basis for a broadcast news on demand system. Experiments carried out during the first half of the project, primarily in the framework of the ARPA/NIST TREC Spoken Document Retrieval Evaluation programme [1], indicate that this is feasible even with current performance of speech recognition systems.

Several innovative intermediate results will be required to achieve this goal, including the transcription of broadcast speech, the development of audio-editing tools, such as hesitation detection and audio-track keyword search, content-based retrieval from audio and video archives, and a robust spoken language interface for search and retrieval of audio and video data.

   
Approach

The basic approach that we have adopted is to use the Abbot LVCSR system [2] to produce approximate transcriptions of the audio documents, and then to treat the task as a text retrieval problem, relying on well-understood techniques to perform indexing and retrieval of the transcribed data. The primary difference to standard text retrieval is that the documents in the archive may have word error rates in excess of 30%. A key feature of any speech retrieval system is its robustness to speech recognition errors.

In addition to the usual keyboard/mouse interface, we are developing a spoken query interface to the THISL system, in which the user may interact verbally with the system, to develop more refined queries.

These components will be brought together in a demonstration system using BBC news output, collected over the duration of the project. The overall architecture of the system is illustrated in figure 1.


  
Figure 1: Architecture of the THISL system for indexing and retrieval of broadcast news. ``srt'' files are speech recognition transcriptions; ``lna'' files are the phone probability files output by the recognizer.
\begin{figure}
\begin{center}
\epsfig{file=arch.eps,width=\columnwidth} \end{center}\end{figure}

   
Speech Recognition

   
The Abbot System

The Abbot LVCSR system [2] uses a hybrid artificial neural network/hidden Markov model (ANN/HMM) acoustic model and a backed-off trigram language model. At recognition time, the recognizer uses a vocabulary of around 65,000 words. On a modern PC it runs in two or three times real-time, producing a single best transcription, a word graph (containing other possible hypotheses) and word and phone level confidence measures.

The Abbot acoustic model is based on a recurrent network [3] that is used to estimate the local posterior probability of each phone given the acoustic data. This approach, which differs from the likelihood estimation of traditional HMMs has a number of benefits, including the use of posteriors to help prune the search space [4] and more compact acoustic models.

Abbot has been trained on British and American English broadcast speech, and a similar system for French is under development. The American system is trained on 100 hours of broadcast speech data; the British English system is trained on around 30 hours of transcribed BBC television and radio news, collected as part of this project. Both systems use language models collected from text databases containing several hundred million words, obtained from broadcast transcriptions and newspaper/newswire text.

The same recognizer is used for both broadcast speech recognition and in the spoken query interface. The spoken query recognizer must be able to recognize both a ``meta-language'' (eg ``I am looking for a clip about...'', ``Have you got anything on...?'') with a vocabulary of a few hundred words and the ``keywords'' used for retrieval. But since we are performing full text retrieval, the set of ``keywords'' is simply the vocabulary of the broadcast news recognizer. The language model is, of course, different when processing spoken queries; however, our current strategy is to process all speech with the same language model, postprocessing spoken query word-graphs with an NLP backend.

   
Confidence Measures and Rejection

A confidence measure on the output of a speech recognizer quantifies how likely it is that a given hypothesis is correct, given information concerning that hypothesis. Confidence measures are important for the THISL application [5]: word-level confidence measures can be used to weight both the words in the indexed documents and/or the spoken query; they can assist in the backend processing of the spoken query; they can be used to segment an incoming audio stream into portions; and they can be used for the ``cleaning'' of training data.

   
Segmentation and Speaker Tracking

Broadcast audio is not presegmented into utterances, stories, documents or speakers. Thus automatic segmentation is important for both the speech recognition and text retrieval parts of the system. Our approach to acoustic segmentation, which can be used to aid the recognizer by dividing the incoming audio stream into ``recognizable'' and ``non-recognizable'' portions, has been based on both Gaussian mixture models of acoustic segments and on the recognizer confidence measures [6]. Unsupervised speaker tracking involves segmenting a broadcast into speaker turns without prior knowledge of speaker identities or the number of speakers. Our approach involves clustering acoustically similar speakers.

Acoustic segmentation alone is not adequate to define ``documents'' for the IR component. The simplest approaches to text segmentation that we are investigating define a document as a window of n words, while more sophisticated approaches build statistical topic models of text.

   
Information Retrieval

In an Information Retrieval system a user has an information need, which is expressed as a textual (or spoken) query. The system's task is to return a ranked list of documents (drawn from an archive) that are best matched to that information need. The process may be iterated, in a loop in which the user indicates which documents are relevant. This approach -- termed relevance feedback -- can be used to refine or expand the initial query for further searches.

The full-text retrieval approach adopted in the THISL system is based on a probabilistic model of information retrieval, in which (loosely speaking) the match between a query and a document is based on a count of matching words (or terms). Stop words (eg, the, to) which do not aid discrimination between relevant and non-relevant documents are not considered and a stemming algorithm is used to transform each word into its stem. A simple count is too crude, so term weighting is employed, which includes additional terms relating to the document length and the number of documents in which a term occurs. We use a term weighting scheme similar to that which has been developed successfully by the Okapi group at City University [7].

In the speech retrieval case the problem is complicated by the fact that the document archive is produced by a speech recognizer. In the case of broadcast news, this means that word error rate is likely to be in the region of 15-40%. An important aim of the THISL project is to investigate the effect of speech recognizer word error rate on retrieval performance (see section 5), and to develop IR algorithms that take into account such uncertainty. Our approaches have been to add terms to the document through the use of lattices or n-best lists, to investigate the use of acoustic confidence measures in the IR document-term weighting function, and to investigate the use of query expansion as a way of adding robustness to possible speech recognition errors. In particular, query expansion [8] makes use of a secondary corpus (in our work this is typically a corpus of newswire text from a similar time period to the primary broadcast corpus). The query is expanded by assuming that the top n documents retrieved from the secondary corpus are relevant, and using a weighted mutual information metric (or similar) to find those terms that are candidates to expand the query. The advantage of this approach is that it enables written text to be used to produce an expanded query that is then applied to the broadcast archive.

The information retrieval component can use both a typed and a spoken query interface. In addition to the obvious advantages of a speech-based interface, the natural language processing component would enable more sophisticated query processing (eg determining whether a word is being used as a meta-word or as a keyword).

   
Demonstration System

Currently we have implemented a prototype demonstration system using an archive of 100 hours of North American broadcast news; by early 1999 a prototype demonstrator using an archive of several hundred hours of BBC news data will have been implemented.

Evaluation of IR systems is important, but requires considerable infrastructure. Developing the North American system has enabled us to evaluate our technology within the NIST TREC (Text Retrieval Conference) programme of evaluations [9]. In particular, the THISL system has participated in the TREC-6 and TREC-7 spoken document retrieval evaluations. The TREC-7 evaluation was based on an archive of 100 hours of North American broadcast news. Using an Ultra 1/167 we were able to decode this speech data at around eight times realtime, with an estimated relative search error or 10-20%, resulting in an overall word error rate (WER) of about 35%. The interpolated recall-precision curve for the THISL system in this evaluation is shown in figure 2. This evaluation enabled cross-recognizer comparisons to assess the effect of different recognizer performance, and these results are also shown in figure 2.


  
Figure 2: Interpolated recall-precision of the THISL system in the TREC-7 Spoken Document Retrieval evaluation. Five different recognition conditions are plotted, all using the THISL IR system. R1 refers to the reference transcripts (the control case); S1 uses the THISL speech recognition system (35% Word Error Rate - WER); B1 uses the medium error baseline speech recognizer supplied by NIST (35% WER); B2 uses the high error baseline speech recognizer supplied by NIST (48% WER); CR-CUHTK (25% WER) uses the Cambridge University HTK speech recognition system.
\begin{figure}
\begin{center}
\epsfig{file=trec7-shef-recp.eps,width=\columnwidth} \end{center}\end{figure}

Bibliography

1
E. M. Voorhees and D. K. Harman, eds., The Sixth Text REtrieval Conference (TREC-6), no. 500-240 in NIST Special Publication, 1998.

2
T. Robinson, M. Hochberg, and S. Renals, ``The use of recurrent networks in continuous speech recognition,'' in Automatic Speech and Speaker Recognition - Advanced Topics (C. H. Lee, K. K. Paliwal, and F. K. Soong, eds.), ch. 10, pp. 233-258, Kluwer Academic Publishers, 1996.

3
A. J. Robinson, ``The application of recurrent nets to phone probability estimation,'' IEEE Trans. Neural Networks, vol. 5, pp. 298-305, 1994.

4
S. Renals and M. Hochberg, ``Start-synchronous search for large vocabulary continuous speech recognition,'' IEEE Trans. Speech and Audio Processing, in press.

5
G. Williams and S. Renals, ``Confidence measures derived from an Acceptor HMM,'' in Proc. Int. Conf. Spoken Language Processing, 1998.

6
J. Barker, G. Williams, and S. Renals, ``Acoustic confidence measures for segmenting broadcast news,'' in Proc. Int. Conf. Spoken Language Processing, 1998.

7
S. E. Robertson and K. Spärck-Jones, ``Simple proven approaches to text retrieval,'' Tech. Rep. TR356, Cambridge University Computer Laboratory, 1997.

8
J. Xu and W. B. Croft, ``Query expansion using local and global document analysis,'' in Proc. ACM SIGIR, 1996.

9
D. Abberley, S. Renals, and G. Cook, ``Retrieval of broadcast news documents with the THISL system,'' in Proc. Int. Conf. Acoustics, Speech and Signal Processing, (Seattle), pp. 3781-3784, 1998.


Steve Renals