m4 - multimodal meeting manager

M4 speech recognition

Decoder comparison

HTK has all the tools needed to train a speech recognition system from a flat start. However it has some limitations. Most notably release 3.2 does not include a decoder that is efficient enough to perform large vocabuary continuous speech recognition (LVCSR). As a result only bigram language models (LMs) can be used with HTK's Viterbi decoder.

The DUcoder is a more efficient stack decoder that is better suited to LVCSR. It uses the libraries from an earlier release of HTK (version 2.0) so many of the file types and models created by HTK 3.2 can be used directly by the DUcoder. It can decode n-gram language models and has many pruning options that enable quicker searches. At present no attempt has been made to port the DUcoder to HTK 3.2. An upgrade would, most likely, require the changes that were made to the HTK 2.0 core libraries to be applied to version 3.2. The difference in versions means that many of the new HTK features are not compatible with the DUcoder. For example, the DUcoder does not automatically recognise PLP feature types. While this incompatibilty may circumvented by changing the feature type label inside each file, other problems are not so easily solved. Table 1 compares HTK's Viterbi decoder (HVite) and the DUcoder.

PLP featuresyesno (incompatible file types, easily fixed)
MLLR trained modelsyesno (incompatible file types)
Trigram LMnoyes
Cross word context dependent decodingyesno (not implemented)
Word internal context dependent decodingyesyes
Table 1: HVite vs. DUcoder

Several possibilities are available to achieve trigram cross word context dependent decoding. We are currently considreing n-best list rescoring and word lattice rescoring using the SRILM toolkit.