SToBS - Structured Transcription of Broadcast Speech
Supported by EPSRC, from 15 December 1998 - 14 December 2001.
Researcher: Heidi Christensen
Although large vocabulary continuous speech recognition systems are now available on the high street, it is apparent that beyond the controlled environment of dictation (noise-free, single cooperative speaker, often a narrow task domain) even the best research systems have an unacceptably low level of performance. The best systems on broadcast news have a word error rate of 10--50% depending on the condition; transcription of spontaneous telephone speech has word error rates of 30--60%. Further, following discussions with users and potential users of spoken language technology (for example, the industrial advisory board of the SPRACH project) it is apparent that while some of the dissatisfaction with current research systems is due to a high word error rate, the unstructured nature of the recognizer output - for example the lack of punctuation and capitalization - also limits its usefulness. In this project we plan to address both these problems through the development of improved statistical language models and the integration of some techniques arising from modern NLP systems.
It is not an exaggeration to say that since the earliest systems produced by IBM, all successful large vocabulary speech recognition systems have been based on n-gram language models. There have been many efforts to provide enhanced language models, based on both richer statistical models and linguistic models: however, these have been notable by their failure to surpass the performance of crude n-gram models. A clear lesson from the past fifteen years is that a naive application of theoretically well-motivated richer models of language has a very low probability of improving a state-of-the-art large vocabulary speech recognition system.
In recent years, however, rule-based NLP systems have demonstrated a high performance in several large vocabulary textual tasks, in particular the information extraction tasks evaluated in the series of ARPA Message Understanding Conferences (MUC). For example, recent MUC evaluations of named entity extraction from newswire texts have resulted in precision and recall scores of over 90% (eg the Sheffield LaSIE system).
Text retrieval systems, based on the ``bag-of-words'' model of text, have also achieved success in unconstrained large vocabulary tasks. Crudely speaking, the bag-of-words model represents a document as a distribution over unigrams. This model is a direct contrast to the n-gram model, since it is entirely global with no local, sequential constraint. There have been various efforts to combine the n-gram and bag-of-words approaches, based around topic dependent or mixture language modelling. Other statistical models of the global structure of text include latent semantic analysis and Poisson mixtures. Latent semantic analysis involves projecting the high dimensional, discrete ``word space'' into a much lower dimensional continuous space using singular value decomposition. Recently, this approach has been investigated for statistical language modelling by Bellegarda and ourselves. Latent semantic analysis is particularly interesting since it offers a route to transform ``language modelling into a signal processing problem'' by enabling language models to be constructed from continuous probability distributions.
Named entity identification in spoken language would be useful as a way of structuring recognizer output, and as part of an information extraction or information retrieval system. Current rule-based systems make use of punctuation, capitalization and other features found in text, thus having a lower performance on raw speech recognizer output -- although no real effort has been made to develop such systems for spoken language. Recently HMM-based systems have been developed for named entity identification with similar recall/precision performance to the best rule-based systems; these systems have demonstrated only a small amount of degradation when retrained on speech recognizer output. This indicates that the combination of simple finite state models and powerful estimation algorithms that has served so well in speech recognition is transferable to more complex speech and language problems involving some degree of understanding. However, such systems are still loosely coupled in the traditional style of speech and language integration; in this project we propose a much tighter coupling, made possible by the finite state nature of the models employed.