Accessing Information in Spoken Audio
Speech is more that just a way of inputting commands and text into a computer. It is an information rich medium in itself, capable of holding expressions of thought, ideas, feelings and emphasis which would be lost in even the most accurate transcription. It is now easy to store hours or even years of speech; but how can we access the stored information easily and quickly?
The objectives of this research are to use statistical models to access information from audio. Much of our work has concentrated on accessing information in broadcast speech, although some ongoing research is concerned with telephone voicemail data. Tasks we have looked at include spoken document retrieval (indexing, search and retrieval of broadcast data), automatic punctuation annotation and the automatic identification of named entities.Spoken Document Retrieval
Our work in spoken document retrieval has been largely in the framework of the THISL project. The aim of this project was to produce a broadcast news retrieval demonstrator for the BBC. The approach adopted was to transcribe radio and television broadcasts using the Abbot speech recognizer and then to index the resulting transcriptions using the thislIR information retrieval system - similar to a web search engine - which allows users to search for news items of interest to them. thislIR returns a list of news clips most relevant to each query which users can listen to. Demonstrators have been produced with both text and spoken query interfaces. The news retrieval demonstrator is currently being evaluated by the BBC.
Spoken document retrieval is rather different from text or web-page retrieval in several ways. There is no definition of a "document" - audio is received as an uninterrupted stream. While it is straightforward to locate individual programmes, it is much more difficult to automatically find segments of a programme, for example stories in a news programme. Indeed, one news story may cover several acoustic conditions (scripted newsreader, interview, outside broadcast, etc.) or one acoustic condition may cover several stories (newsreader). Also, there will inevitably be speech recognition errors, together with sections of non-speech audio (music, crowd noise, etc.). To deal with these problems we have investigated automatic segmentation methods, and speech recognition confidence measures. Finally, speech recognition systems have a large, but finite, vocabulary, so there is a possibility that some words (eg new names in the news) cannot be recognized. We have developed statistical query expansion algorithms for spoken document retrieval that add related words to the query. This improves overall accuracy and can compensate to some extent for the occasional problem of out of vocabulary words.
We have been able to evaluate our spoken document retrieval systems as part of TREC (the Text Retrieval Conference), which has run a Spoken Document Retrieval track since 1997. We have participated in this track each year with very good results. The results of these evaluations, which used North American broadcast news data, were interesting - and surprising. For this task, with a database of about 550 hours of audio, there was very little difference between the speech recognition based systems and systems based on human-generated reference transcripts. There was also no correlation between speech recognizer word error rate and the overall precision and recall of the system (for systems with error rates ranging from 20-32%).Structured Transcriptions and Information Extraction
Although large vocabulary continuous speech recognition systems are now available on the high street, it is apparent that beyond the controlled environment of dictation (noise-free, single cooperative speaker, often a narrow task domain) even the best research systems have an unacceptably low level of performance. The best systems on broadcast news have a word error rate of 10--50% depending on the condition; transcription of spontaneous telephone speech has word error rates of 30--60%. Further, following discussions with users and potential users of spoken language technology it is apparent that while some of the dissatisfaction with current research systems is due to a high word error rate, the unstructured nature of the recognizer output - for example the lack of punctuation and capitalization - also limits its usefulness. In this strand of research we have addressed these problems by developing statistical methods to structure the output of a large vocabulary speech recognizer. Some of this work was performed as part of the SToBS project.
We have investigated two approaches to structuring the output of a speech recognizer: automatic punctuation and the identification of named entities (people, places, organizations, etc.) For both these tasks we used a statistical finite state architecture, which is trainable from annotated text. These approaches work well - considering the simplicity of the underlying model - and we achieved the best results in the most recent (1999) evaluation for identifying named entities in broadcast speech.
Another strand of this work is concerned with developing summaries of telephone voicemail data. Aside from being a challenging scientific problem, this work has potential application in areas such as text notification of voicemail messages.Current and Future Work
At the moment we are particularly interested in the following tasks:
And we are excited about several (related) statistical frameworks: