Large Vocabulary Continuous Speech Recognition (LVCSR)
Most speech recognition systems are based on generative models of acouistics, based on hidden Markov models (HMMs) that typically use Gaussian mixture models to generate sequences of acoustic vectors. This is a powerful approach and (in conjunction with a language model) results in a full probability model for the speech recognition task: the joint distribution of a sequence of words and a sequence of vectors. While this is the optimal thing to do, if it is known that we are working in the correct model space, it is not necessarily optimal in an incorrect model space. In particular it may not result in the best use of parameters to solve the problem.
We have pursued discriminative acoustic models based on connectionist (neural) networks, typically multi-layer perceptrons (MLPs) and recurrent neural networks (RNNs). The outputs of these networks, when trained as phone classifiers, are estimates of the posterior probability of a phone given the acoustic data. These may be transformed to scaled likelihoods and used as the "output distribution" in an HMM. The resultant model is discriminative, and may be regarded as a stochastic finite state acceptor model. Benefits of this approach include good recognition performance using context-independent models and relatively compact acoustic models.
Furthermore, the posterior probability estimates output by the networks may be used as confidence measures with a direct link to the statistical model. We have used this for rejection, evaluation of pronunciation models, model combination and pruning in search.Language Models
In the statistical framework the task of the language model is to predict the next word given the previous word history. It has proven extremely difficult to produce better (in terms of speech recognition accuracy) models than the venerable trigram which uses the previous two words to predict the next. Maximum likelihood estimation of trigrams is straightforward (count!) but to avoid assigning zero probabilities smoothing and discounting must be employed. We have done a good deal of work in this area, both in building up a local software infrastructure and in the context of information extraction.
We have investigated some new approaches to language modelling which may be applied in the n-gram framework. Topic-based mixture language models involved the use of information retrieval approaches (particularly tf.idf, the Okapi weight and latent semantic analysis) to cluster text in automatically inferred topic categories. We have also investiogated the use of distributions over word rate as a form of prior for n-gram models. In particular we looked at the Poisson and infinite mixture of Poissons (negative binomial) as variable word rate distributions. This may be regarded as an elegant generalization of language model caching.Search
The search problem in LVCSR can be simply stated: find the most probable sequence of words given a sequence of acoustic observations, an acoustic model and a language model. This is a demanding problem since word boundary information is not available in continuous speech and each word in the dictionary may be hypothesized to start at each frame of acoustic data. The problem is further complicated by the vocabulary size (typically 65,000 words) and the structure imposed on the search space by the language model. Direct evaluation of all the possible word sequences is impossible (given the large vocabulary) and an efficient search algorithm will consider only a very small subset of all possible utterance models. Typically, the effective size of the search space is reduced through pruning of unlikely hypotheses and/or the elimination of repeated computations.
We developed a novel, efficient search strategy based on stack decoding, that we referred to as start-synchronous search. This is a single-pass algorithm that is naturally factored into time-asynchronous processing of the word sequence and time-synchronous processing of the HMM state sequence. The search architecture enables the search to be decoupled from the language model while still maintaining the computational benefits of time-synchronous processing.
In a precursor to the acoustic confidence measure research (above) we developed a novel pruning technique, referred to as phone deactivation pruning. This method of pruning - which is complimentary to other beam search and language model pruning methods - uses the phone posterior probability estimates output by a connectionist acoustic model directly. A threshold is placed on the phone posteriors, and those phones below the threshold are pruned. This approach enabled a search space reduction of up to 70%, with a relative search error of less than 2%.
ABBOT is a speaker-independent continuous speech recognition system, designed for large vocabulary tasks. It was initially developed at Cambridge University Engineering Department as part of the WERNICKE project. At that time it was the joint work of Tony Robinson, Mike Hochberg (now at Nuance) and Steve Renals. Since then, Steve moved to Sheffield and it was a continuing project (supported by projects such as SPRACH and THISL) between Cambridge and Sheffield. Since then Tony Robinson has left Cambridge to work full-time with his company SoftSound, who are exploiting the parts of Abbot developed at Cambridge.
During SPRACH we also collaborated heavily with ICSI Berkeley, and we produced a joint system for the 1998 DARPA Broadcast News evaluation, that integrated approaches and software from Cambridge, ICSI and Sheffield. At Sheffield we are using this basic system for the recognition of broadcast speech and telephone voicemail.Current and Future Work
Current work in speech recognition includes: