RSS

SToBS - Structured Transcription of Broadcast Speech

Supported by EPSRC, from 15 December 1998 - 14 December 2001.

Investigators: Steve Renals, Rob Gaizauskas and Yoshi Gotoh.

Researcher: Heidi Christensen

Background

Although large vocabulary continuous speech recognition systems are now available on the high street, it is apparent that beyond the controlled environment of dictation (noise-free, single cooperative speaker, often a narrow task domain) even the best research systems have an unacceptably low level of performance. The best systems on broadcast news have a word error rate of 10--50% depending on the condition; transcription of spontaneous telephone speech has word error rates of 30--60%. Further, following discussions with users and potential users of spoken language technology (for example, the industrial advisory board of the SPRACH project) it is apparent that while some of the dissatisfaction with current research systems is due to a high word error rate, the unstructured nature of the recognizer output - for example the lack of punctuation and capitalization - also limits its usefulness. In this project we plan to address both these problems through the development of improved statistical language models and the integration of some techniques arising from modern NLP systems.

It is not an exaggeration to say that since the earliest systems produced by IBM, all successful large vocabulary speech recognition systems have been based on n-gram language models. There have been many efforts to provide enhanced language models, based on both richer statistical models and linguistic models: however, these have been notable by their failure to surpass the performance of crude n-gram models. A clear lesson from the past fifteen years is that a naive application of theoretically well-motivated richer models of language has a very low probability of improving a state-of-the-art large vocabulary speech recognition system.

In recent years, however, rule-based NLP systems have demonstrated a high performance in several large vocabulary textual tasks, in particular the information extraction tasks evaluated in the series of ARPA Message Understanding Conferences (MUC). For example, recent MUC evaluations of named entity extraction from newswire texts have resulted in precision and recall scores of over 90% (eg the Sheffield LaSIE system).

Text retrieval systems, based on the ``bag-of-words'' model of text, have also achieved success in unconstrained large vocabulary tasks. Crudely speaking, the bag-of-words model represents a document as a distribution over unigrams. This model is a direct contrast to the n-gram model, since it is entirely global with no local, sequential constraint. There have been various efforts to combine the n-gram and bag-of-words approaches, based around topic dependent or mixture language modelling. Other statistical models of the global structure of text include latent semantic analysis and Poisson mixtures. Latent semantic analysis involves projecting the high dimensional, discrete ``word space'' into a much lower dimensional continuous space using singular value decomposition. Recently, this approach has been investigated for statistical language modelling by Bellegarda and ourselves. Latent semantic analysis is particularly interesting since it offers a route to transform ``language modelling into a signal processing problem'' by enabling language models to be constructed from continuous probability distributions.

Named entity identification in spoken language would be useful as a way of structuring recognizer output, and as part of an information extraction or information retrieval system. Current rule-based systems make use of punctuation, capitalization and other features found in text, thus having a lower performance on raw speech recognizer output -- although no real effort has been made to develop such systems for spoken language. Recently HMM-based systems have been developed for named entity identification with similar recall/precision performance to the best rule-based systems; these systems have demonstrated only a small amount of degradation when retrained on speech recognizer output. This indicates that the combination of simple finite state models and powerful estimation algorithms that has served so well in speech recognition is transferable to more complex speech and language problems involving some degree of understanding. However, such systems are still loosely coupled in the traditional style of speech and language integration; in this project we propose a much tighter coupling, made possible by the finite state nature of the models employed.

Objectives

  • The development of an integrated approach to transcribe and identify named entities in broadcast speech, using the acoustic and pronunciation models of the ABBOT system;
  • The establishment of both trainable finite state (HMM) and rule-based named entity identifiers and their comparative quantitative evaluation on both spoken and textual data;
  • The development of very large vocabulary speech recognition systems through named entity language modelling;
  • The investigation and evaluation of new statistical models of language for speech recognition, specifically (1) content language modelling based on latent semantic analysis and (2) Poisson mixture language modelling;
  • The establishment of systems for automatically punctuating recognizer output using rule-based and statistical approaches, and their quantitative comparative evaluation;
  • Heidi Christensen
  • Participation in international spoken language evaluations including DARPA Hub-4E, TREC/SDR, and MUC evaluations (or their successors).

Publications

  • Y. Gotoh and S. Renals.
    Variable word rate n-grams.
    In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, Istanbul, 2000.
    [abstract/download] .
  • M. Stevenson and R. Gaizauskas.
    Using Corpus-derived Name Lists for Named Entity Recognition.
    Proc. Applied Natural Language Processing and North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-2000), Seattle, Washington, April 29-May 3 2000
    To appear.
    [postscript].
  • M. Stevenson and R. Gaizauskas.
    Experiments on Sentence Boundary Detection.
    Proc. Applied Natural Language Processing and North American Chapter of the Association for Computational Linguistics (ANLP-NAACL-2000), Seattle, Washington, April 29-May 3 2000
    To appear.
    [postscript].
  • Y. Gotoh and S. Renals.
    Information extraction from broadcast news.
    Philosophical Transactions of the Royal Society of London, Series A, 2000.
    To appear.
    [abstract/download].
  • Y. Gotoh and S. Renals.
    Topic-based mixture language modelling.
    J. Natural Language Engineering, 1999.
    In review.
    [abstract/download].
  • S. Renals and Y. Gotoh.
    Integrated transcription and identification of named entities in broadcast speech.
    In Proc. Eurospeech, Budapest, 1999.
    To appear.
    [abstract/download].
  • S. Renals, Y. Gotoh, R. Gaizauskas, and M. Stevenson.
    The SPRACH/LaSIE system for named entity identification in broadcast news.
    In Proc. DARPA Broadcast News Workshop, 1999.
    To appear.
    [abstract/download].
  • Y. Gotoh and S. Renals.
    Statistical annotation of named entities in spoken audio.
    In Proc. ESCA Workshop on Accessing Information In Spoken Audio, pages 43-48, Cambridge, 1999.
    [abstract/download].
  • Y. Gotoh, S. Renals, and G. Williams.
    Named entity tagged language models.
    In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, pages 513-516, Phoenix AZ, 1999.
    [abstract/download].