Software

We provide three software baselines for array synchronization, enhancement, and conventional or end-to-end ASR.

To refer to these baselines in a publication, please cite:

Array synchronization baseline

The start of each device recording has been synchronised by aligning the onset of a synchronisation tone played at the beginning of each recording session. However, the signals can become progressively out of synch due to a combination of clock drift (on all devices) and occasional frame-dropping on the Kinects. This misalignment is approximately compensated by the device-dependent utterance start and end times provided in the transcripts.

The baseline device-dependent start and end times have been computed in two steps using a simple cross-correlation based approach.

  1. estimate_alignment.py: Estimation of a 'recording time' to 'signal delay' mapping between a reference binaural recorder and all other devices. The delay between a pair of channels is estimated at regular intervals throughout the party by locating the peak in a cross-correlation between windows of each signal. Estimates are fitted with a linear regression when comparing binaural mics (no frame dropping) and are smoothed with median filter when comparing the binaural mic to Kinect recordings.
  2. align_transcription.py: The transcript files are augmented with the device-dependent utterance timings. The original binaural recorder transcription times are first mapped onto the reference binaural recorder, and then from the reference recorder onto each of the Kinects.
The array synchronisation baseline is available on github.

Enhancement and conventional ASR baseline using Kaldi

The enhancement and ASR baseline is distributed through the Kaldi github repository in kaldi/egs/chime5/s5.

The main script (run.sh) includes:

  1. Data preparation (stage 0 and 1):
    • Prepare Kaldi format data directories, lexicon, and language models
    • Language model: maximum entropy based 3-gram
      data/srilm/best_3gram.gz -> 3gram.me.gz
    • Vocabulary size: 127,712
      $wc -l data/lang/words.txt
      127712 data/lang/words.txt
  2. Enhancement (stage 2):
  3. Feature extraction and data arrangement (stage 3-6):
    • MFCC feature extraction
    • Training data (250k utterances, in data/train_worn_u100k), which combines left and right channels (150k utterances) of the binaural microphone data (data/train_worn) and a subset (100k utterances) of all array microphone data (data/train_u100k)
    Note that we confirmed some improvements when we use larger amounts of training data instead of the above subset. However, we limit the size of data to finish all experiments with feasible computational resources.
  4. GMM baseline (stage 7-16):
    • Training and recognition script using Gaussian mixture models (GMMs). The recognition is performed with enhanced speech data for the development and evaluation sets. The GMM baseline includes the standard triphone based acoustic models with various feature transformations including linear discriminant analysis (LDA), maximum likelihood linear transformation (MLLT), and feature space maximum likelihood linear regression (fMLLR) with speaker adaptive training (SAT).
  5. Data cleaning up (stage 17):
    • This stage removes several irregular utterances, which improves the final performance of the system.
  6. LF-MMI TDNN baseline script (stage 18):
    • This is an advanced time-delayed neural network (TDNN) baseline using lattice-free maximum mutual information (LF-MMI) training [2]. This baseline requires relatively massive computational resources: multiple GPUs for DNN training, many CPUs for i-vector and lattice generation, and large storage space for data augmentation (speed perturbation).

[1] X. Anguera, C. Wooters, and J. Hernando, "Acoustic beamforming for speaker diarization of meetings", IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 7, pp. 2011-2023, 2007.
[2] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, "Purely sequence-trained neural networks for ASR based on lattice-free MMI", in Proc. Interspeech, pp. 2751-2755, 2016.

You can run the baseline as follows:

  1. Download Kaldi, compile Kaldi tools, and install BeamformIt for beamforming, Phonetisaurus for constructing a lexicon using grapheme to phoneme conversion, and SRILM for language model construction. For SRILM, you need to download the source (srilm.tgz) first.
    git clone https://github.com/kaldi-asr/kaldi.git
    cd kaldi/tools
    git checkout 22fbdd96960834e1f3c7481f9a859d6c4cb809e7
    make -j                             # "-j" option parallelize compile
    ./extras/install_beamformit.sh      # BeamformIt
    ./extras/install_srilm.sh           # Get source from http://www.speech.sri.com/projects/srilm/download.html first
    ./extras/install_phonetisaurus.sh   # G2P
  2. Compile Kaldi.
    cd ../src
    ./configure
    make depend -j
    make -j
  3. Move to the CHiME-5 ASR baseline in the Kaldi egs/ directory.
    cd ../kaldi/egs/chime5/s5
  4. Specify model and CHiME-5 root paths in run.sh
    chime5_corpus=<your CHiME-5 path>
  5. Execute run.sh.
    ./run.sh
    We suggest using the following command to save the main log file:
    nohup ./run.sh > run.log
  6. If your experiments have failed or you want to resume your experiments at some stage, you can use the following command (this example is to rerun GMM experiments from stage 7):
    ./run.sh --stage 7
  7. If you have your own enhanced speech data for test, you can perform your own enhancement.
  8. You can find the resulting word error rates (WERs) in the following files:
    enhancement=beamformit
    # GMM
    exp/tri3/decode_dev_${enhancement}_ref
    exp/tri3/decode_eval_${enhancement}_ref
    # LF-MMI TDNN
    exp/chain_train_worn_u100k_cleaned/tdnn1a_sp/decode_dev_${enhancement}_ref
    exp/chain_train_worn_u100k_cleaned/tdnn1a_sp/decode_eval_${enhancement}_ref

Note:

Enhancement and end-to-end ASR baseline using ESPnet

The enhancement and ASR baseline is distributed through the ESPnet github repository in espnet/egs/chime5/s5.

The main script (run.sh) includes:

  1. Data preparation (stage 0):
    • Same as Kaldi data directory preparation
    • Includes weighted delay-and-sum beamforming based on the BeamformIt toolkit [1]
  2. Feature extraction (stage 1):
    • Log-Mel-filterbank and pitch features
    • Training data (350k utterances, in data/train_worn_u200k), which combines both channels (150k utterances) of the binaural microphone data (data/train_worn) and a subset (200k utterances) of all array microphone data (data/train_u200k)
  3. Dictionary preparation (stage 2):
    • Make a (character-level) dictionary and prepare transcriptions based on the JSON format
  4. Language model training (stage 3):
    • A character-level LSTM language model is constructed, which is integrated with a decoder network during recognition.
  5. End-to-end model training (stage 4):
    • Train an attention/CTC hybrid architecture, which regularizes an attention encoder-decoder network with a CTC objective. Details are described in [3].
  6. Recognition (stage 5):
    • Recognize test data with attention/CTC joint decoding with a LSTM language model fusion. Details are described in [3].

[3] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, and Tomoki Hayashi, "Hybrid CTC/Attention Architecture for End-to-End Speech Recognition", IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240-1253, 2017.

You can run the baseline as follows:

  1. Download ESPnet, compile Kaldi, chainer, pytorch and other tools:
    git clone https://github.com/espnet/espnet.git
    cd espnet/tools
    git checkout e3d1854150a1b4371f9635267e047f9dae950aa2
    make -j                             # "-j" option parallelize compile
    cd kaldi/tools
    ./extras/install_beamformit.sh      # BeamformIt
    or if you already installed the correct version of Kaldi with the required tools above, you can simply set a symlink instead of compiling Kaldi. Note that the end-to-end system does not require installing a G2P tool.
    git clone https://github.com/espnet/espnet.git
    cd espnet/tools
    git checkout e3d1854150a1b4371f9635267e047f9dae950aa2
    ln -s <your kaldi> kaldi
    make -j                             # "-j" option parallelize compile
  2. Move to the CHiME-5 ASR baseline in the ESPnet egs/ directory.
    cd espnet/egs/chime5/s5
  3. Specify model and CHiME-5 root paths in run.sh
    chime5_corpus=<your CHiME-5 path>
  4. Execute run.sh.
    ./run.sh # CPU mode
    or
    ./run.sh --gpu 0 # GPU mode
    We suggest using the following command to save the main log file:
    nohup ./run.sh > run.log
  5. If your experiments have failed or you want to resume your experiments at some stage, you can use the following command (this example is to rerun training experiments from stage 4):
    ./run.sh --stage 4
  6. If you have your own enhanced speech data for test, you can perform your own enhancement.
  7. You can find the resulting word error rates (WERs) in the following files:
    enhancement=beamformit
    exp/train_worn_u200k_.../decode_dev_${enhancement}_ref.../result.wrd.txt
    exp/train_worn_u200k_.../decode_eval_${enhancement}_ref.../result.wrd.txt

Note: