Data

License and citation

The original WSJ0 data are copyrighted material by the LDC. Users are expected to comply with the license agreement they signed with the LDC. This requirement also applies to simulated training data derived from the original WSJ0 data.

Real data and booth data are provided under the Creative Commons Attribution-NonCommercial-ShareAlike license, version 2.0. Simulated data derived from real data and booth data are provided under the same license.

If you are working for a company, these licenses allow you to participate in the challenge. If you wish to use the data to derive a commercial product, please contact us.

To refer to these data in a publication, please cite:

The data descriptions are based on the following items:

Overview

The 3rd CHiME challenge sets a target for distant-talking automatic speech recognition using a read speech corpus. We use a similar setup as the 2013 CHiME Challenge Track 2 based on the speaker-independent medium (5k) vocabulary subset of the Wall Street Journal (WSJ0) corpus, and we also provide baseline software including data simulation, speech enhancement, and ASR. The ASR baseline uses the Kaldi ASR toolkit. Two types of data are employed: `Real data' -- speech data that is recorded in real noisy environments (on a bus, cafe, pedestrian area, and street junction) uttered by actual talkers. `Simulated data' - noisy utterances that have been generated by artificially mixing clean speech data with noisy backgrounds. The ultimate goal is to recognise the real data. Main audio data are provided as 16 bit stereo WAV files sampled at 16 kHz.

Training set: 1600 (real) + 7138 (simulated) = 8738 noisy utterances from a total of 4 speakers in the real data, and 83 speakers forming the WSJ0 SI-84 training set in the 4 noisy environments. The transcriptions are also based on those of the WSJ0 SI-84 training set, but the real speech utterances do not contain verbal punctuations (e.g., “period” and “hyphen” in the original WSJ0 SI-84). All of the reading errors in these transcriptions are corrected appropriately.

Development set: 410 (real) X 4 (environments) + 410 (simulated) X 4 (environments) = 3280 utterances from 4 other speakers than the speakers in the training data. The utterances are based on the “no verbal punctuation” (NVP) part of the WSJ0 speaker-independent 5k vocabulary development set.

Test set: 330 (real) X 4 (environments) + 330 (simulated) X 4 (environments) = 2640 utterances from 4 other speakers. Similarly to the development set, the utterances are based on the “no verbal punctuation” (NVP) part of the WSJ0 speaker-independent 5k vocabulary evaluation set. The test set follows the same specifications as the development set and will be available from 1st June.

The rest of the document describes the detailed directory structure and naming conventions of the data, packaged in the 3rd CHiME challenge.

Data descriptions in detail

The 3rd CHiME challenge provides the audio data, annotations, transcriptions, and subset of the original WSJ0 database, based on the following directory structure:

CHiME3/data/
├── annotations
├── audio
├── transcriptions
└── WSJ0

Audio

All audio data (real, simulated, and enhanced audio data) are distributed with a sampling rate of 16 kHz. The 3rd CHiME challenge baseline system including data simulation, speech enhancement, and ASR uses only the 16 kHz audio data.

The audio data consists of the background noises (backgrounds), enhanced speech data using the baseline speech enhancement technique (enhanced), unsegmented noisy speech data (embedded), and segmented noisy speech data (isolated) based on the following data structure:

CHiME3/data/audio/16kHz/
├── backgrounds
├── enhanced
├── embedded
└── isolated

Isolated: segmented noisy speech data

The segmented noisy speech data are composed of real (REAL), simulated (SIMU), and clean (ORG) speech data. The table below summarizes the subdirectories in CHiME3/data/audio/16kHz/isolated/. The name of the subdirectories represents 1) training, development, and evaluation sets (dt05, et05, and tr05), 2) recording locations (bth, bus, caf, ped, and str), and 3) real and simulated data.


Real/Simu

Location

Channels

# speakers

# utterances

hour

# of WAV files

dt05_bth


BTH

0-6

4

410

0.72

2870

dt05_bus_real

REAL

BUS

0-6

4

410

0.68

2870

dt05_bus_simu

SIMU

BUS

1-6

4

410

0.72

2460

dt05_caf_real

REAL

CAF

0-6

4

410

0.69

2870

dt05_caf_simu

SIMU

CAF

1-6

4

410

0.72

2460

dt05_ped_real

REAL

PED

0-6

4

410

0.67

2870

dt05_ped_simu

SIMU

PED

1-6

4

410

0.72

2460

dt05_str_real

REAL

STR

0-6

4

410

0.7

2870

dt05_str_simu

SIMU

STR

1-6

4

410

0.72

2460

tr05_bth


BTH

0-6

4

399

0.75

2793

tr05_bus_real

REAL

BUS

0-6

4

400

0.69

2800

tr05_bus_simu

SIMU

BUS

1-6

83

1728

3.71

10368

tr05_caf_real

REAL

CAF

0-6

4

400

0.76

2800

tr05_caf_simu

SIMU

CAF

1-6

83

1794

3.77

10764

tr05_org



single

83

7138

15.15

7138

tr05_ped_real

REAL

PED

0-6

4

400

0.72

2800

tr05_ped_simu

SIMU

PED

1-6

83

1765

3.75

10590

tr05_str_real

REAL

STR

0-6

4

400

0.73

2800

tr05_str_simu

SIMU

STR

1-6

83

1851

3.92

11106

Each subdirectory contains a set of WAV files. The name of each WAV file represents the speaker, transcription, location, and channel index, as follows:

Naming convention of isolated noisy speech wav file
CHiME3 naming convention of isolated noisy speech wav file (isolated)

Note that the channel indexes 1 to 6 (*.CH[1-6].wav) specify the tablet microphones (see microphone positions in the tablet), and channel index 0 (*.CH0.wav) denotes the close talk microphone. The simulated data do not contain WAV files for the close talk microphone.

By following this naming convention, the converted WAV files from the original WSJ0 data are also renamed as follows:

Naming convention of isolated clean speech wav file
CHiME3 naming convention of isolated clean speech wav file (isolated)

Note that the channel indexes of isolated clean speech WAV files are omitted.

Enhanced: enhanced speech data

The enhanced speech data are obtained from the baseline speech enhancement method. The subdirectory structure is almost the same except that it does not include booth and original clean speech data (dt05_bth, tr05_bth, and tr05_org). The naming convention is as follows:

Naming convention of enhanced speech WAV file
CHiME3 naming convention of enhanced speech WAV file (enhanced)


Real/Simu

Location

Channels

# speakers

# utterances

hour

# of WAV files

dt05_bus_real

REAL

BUS

single

4

410

0.68

410

dt05_bus_simu

SIMU

BUS

single

4

410

0.72

410

dt05_caf_real

REAL

CAF

single

4

410

0.69

410

dt05_caf_simu

SIMU

CAF

single

4

410

0.72

410

dt05_ped_real

REAL

PED

single

4

410

0.67

410

dt05_ped_simu

SIMU

PED

single

4

410

0.72

410

dt05_str_real

REAL

STR

single

4

410

0.7

410

dt05_str_simu

SIMU

STR

single

4

410

0.72

410

tr05_bus_real

REAL

BUS

single

4

400

0.69

400

tr05_bus_simu

SIMU

BUS

single

83

1728

3.71

1728

tr05_caf_real

REAL

CAF

single

4

400

0.76

400

tr05_caf_simu

SIMU

CAF

single

83

1794

3.77

1794

tr05_ped_real

REAL

PED

single

4

400

0.72

400

tr05_ped_simu

SIMU

PED

single

83

1765

3.75

1765

tr05_str_real

REAL

STR

single

4

400

0.73

400

tr05_str_simu

SIMU

STR

single

83

1851

3.92

1851

Note that the directory assumes that the enhanced speech data are single channel WAV files, and the channel information is not included in the file name. If you use your own speech enhancement technique, and evaluate it with the provided ASR tool, you should retain the exactly same directory structure and audio file names with this enhanced directory, and only change the directory name.

Backgrounds: background noises

Background noises were also recorded using the same tablet device at the same noisy locations (BUS, CAF, PED, and STR). These noises are employed to create simulated data matched with the real noisy speech data. Since these were recorded without speech, these do not include close-talk microphone signals (*.CH0.wav). All background noises are stored in CHiME3/data/audio/16kHz/backgrounds without using subdirectories.

The naming convention is as follows:

Naming convention of background noise WAV file
CHiME3 naming convention of background noise WAV file (backgrounds)


Real/Simu

Location

Channels

# sessions

hour

# of WAV files

backgrounds

REAL

BUS/CAF/PED/STR

1-6

17

8.42

102


Embedded: unsegmented noisy speech data

The unsegmented noisy speech data (embedded) are originally recorded data, and the segmented noisy speech data (isolated) are obtained by segmenting these embedded data into separate utterances. The segmentation information can be found in the JSON files in CHiME3/data/annotations/, which is explained in the annotation section below. All embedded data are stored in CHiME3/data/audio/16kHz/embedded without using subdirectories.

The naming convention is as follows:

Naming convention of unsegmented noisy speech WAV file
CHiME3 naming convention of unsegmented noisy speech WAV file (embedded)


Real/Simu

Location

Channels

# speakers

# sessions

hour

# of WAV files

embedded

REAL

BUS/CAF/PED/STR

0-6

8

51

13.98

357


Annotations

Annotation files in the CHiME3 data are based on the JSON (JavaScript Object Notation) format (see http://json.org/ in more detail). We prepared 7 JSON annotation files that contain all information needed for data simulation, speech enhancement, and ASR experiments.

Note that, in the case when the speaker repeated a sentence several times, only one instance was retained and annotated, but the other instances were not removed from the embedded recordings.

CHiME3/data/annotations/dt05_real.json

The JSON files contain the various annotations for every utterance. Real utterances have the following 8 basic fields:

    {
        "dot": "Chrysler reduced some prices on Friday",
        "end": 35.51843750000000,
        "environment": "BUS",
        "prompt": "Chrysler reduced some prices on Friday.",
        "speaker": "M03",
        "start": 32.53018750000000,
        "wavfile": "M03_141106_040_BUS",
        "wsj_name": "050C010A"
    },

CHiME3/data/annotations/dt05_simu.json

In addition to the above basic fields, the JSON file for the simulated development set has some additional fields:

    {
        "dot": "Chrysler reduced some prices on Friday",
        "end": 45.08006250000000,
        "environment": "BUS",
        "noise_end": 35.48262500000000,
        "noise_start": 32.56600000000000,
        "noise_wavfile": "M03_141106_040_BUS",
        "prompt": "Chrysler reduced some prices on Friday.",
        "speaker": "M03",
        "start": 42.16343750000000,
        "wavfile": "M03_141106_010_BTH",
        "wsj_name": "050C010A"
    },

CHiME3/data/annotations/tr05_simu.json

Similarly, the JSON file for the simulated training set has some additional fields:

    {
        "dot": "I always wanted to work on the inside in",
        "environment": "PED",
        "ir_end": 424.3202500000000,
        "ir_start": 420.8561875000000,
        "ir_wavfile": "F02_141106_050_PED",
        "noise_end": 1150.329625000000,
        "noise_start": 1146.872312500000,
        "noise_wavfile": "BGD_150203_010_PED",
        "prompt": "I always wanted to work on the inside in.\"",
        "speaker": "011",
        "wsj_name": "011C0207"
    },


Transcriptions

There are two types of transcription formats:

The transcription directory has similar subdirectory structure to that of the segmented noisy speech data (isolated, CHiME3/data/audio/16kHz/isolated). The naming convention also follows that of the segmented noisy speech data except that the transcription file does not have the information of the channel, i.e.,

Naming convention of transcription file
CHiME3 naming convention of transcription file (transcriptions)

In CHiME3/data/transcriptions, there are also *.dot_all and *.trn_all files that contain a set of DOT and TRN transcriptions, where each line corresponds to a DOT/TRN transcription. The dot_all/trn_all files and the dot/trn files in the subdirectories carry the same information.

WSJ0

This directory is a subset of the original WSJ0 corpus (either LDC93S6A or LDC93S6B) that is used to build an ASR baseline. It contains language models, transcriptions, and sphere format audio data (*.WV1 in si_dt_05, si_et_05, and si_tr_s directories). Part of these data are duplicated with the ones in CHiME3/data/audio/16kHz/isolated/tr05_org, but the audio data in this directory are stored in NIST SPHERE format. In the ASR baseline, these are converted on-the-fly by using sph2pipe, which is included in Kaldi.

All data is available at the download center.