M4 speech recognition
SWITCHBOARD recognition system
The current Swithcboard recogniser is trained on approximately 200 hours of speech excluding approximately 14 hours which were reserved for testing. The test files are those labelled sw02600 to sw02699. The audio is upsampled to 16kHz then PLPs and log energy with first and second derivative features were computed. The upsampling is done so that ICSI meetings data and M4 meetings data can be left at 16kHz sample rate.
The system employs two set of acoustic models in order to overcome the drawbacks of HVite. Using the word internal context dependent triphone models (hmm3044) and the DUcoder an n-best list of lattices (where n=200) is generated. In a second stage HVite uses the cross word context dependent triphone models (hmm3144) to rescore the lattices. In effect, the DUcoder limits the search space for HVite. No language model is used in the second stage at present, only the lattices and acoustic models.
Using a language model scale factor of 15 and word insertion penalty of -10 for both lattice generation and lattice rescoring the test word error rate on one out of the 13 hours of test data is 45.41%.
SWITCHBOARD 1R2 transcripts
The (29 Jan 2003) ISIP transcripts from Mississippi State University were used. Some minor modifications were made to the dictionaries for HTK and DUcoder compatibility.
Some HTK models will be published here as they become available. Other models are available on request: these include 1, 2, 4, 8, 16 or 32 Gaussian, monophone or triphone, PLP models.
|Model ID||Details||Test WER (%)|
|plp-hmm10||PLP, 1 Gaussian/state, 46 monophones, ML||--|
|plp-hmm3044||PLP, 16 Gaussians/state, word internal triphone models, 4123 tied triphone states, ML||--|
|plp-hmm3144||PLP, 16 Gaussians/state, cross word triphone models, 4221 tied triphone states, ML||45.41 (by rescoring lattices from hmm3044)|
Language modelsBigram language models:
- Small LM trained on the SWITCHBOARD transcripts of sw02001 to sw02499
- Small LM trained on the SWITCHBOARD transcripts of sw02600 to sw02699
- LM trained on the whole of SWITCHBOARD minus a 15 hour test set (excludes sw02600 to sw02699)
- LM trained on the whole of SWITCHBOARD