m4 - multimodal meeting manager

M4 speech recognition

SWITCHBOARD recognition system

The current Swithcboard recogniser is trained on approximately 200 hours of speech excluding approximately 14 hours which were reserved for testing. The test files are those labelled sw02600 to sw02699. The audio is upsampled to 16kHz then PLPs and log energy with first and second derivative features were computed. The upsampling is done so that ICSI meetings data and M4 meetings data can be left at 16kHz sample rate.

The system employs two set of acoustic models in order to overcome the drawbacks of HVite. Using the word internal context dependent triphone models (hmm3044) and the DUcoder an n-best list of lattices (where n=200) is generated. In a second stage HVite uses the cross word context dependent triphone models (hmm3144) to rescore the lattices. In effect, the DUcoder limits the search space for HVite. No language model is used in the second stage at present, only the lattices and acoustic models.

Using a language model scale factor of 15 and word insertion penalty of -10 for both lattice generation and lattice rescoring the test word error rate on one out of the 13 hours of test data is 45.41%.

SWITCHBOARD 1R2 transcripts

The (29 Jan 2003) ISIP transcripts from Mississippi State University were used. Some minor modifications were made to the dictionaries for HTK and DUcoder compatibility.


Acoustic models

Some HTK models will be published here as they become available. Other models are available on request: these include 1, 2, 4, 8, 16 or 32 Gaussian, monophone or triphone, PLP models.

Model ID Details Test WER (%)
plp-hmm10 PLP, 1 Gaussian/state, 46 monophones, ML --
plp-hmm3044 PLP, 16 Gaussians/state, word internal triphone models, 4123 tied triphone states, ML --
plp-hmm3144 PLP, 16 Gaussians/state, cross word triphone models, 4221 tied triphone states, ML 45.41 (by rescoring lattices from hmm3044)

Language models

Bigram language models: Trigram language models: