Missing Data ASR


introduction | demonstration | investigate | reading | credits | downloading | home

Introduction

Listeners are capable of perceiving speech in the face of quite severe distortions (see the distorted speech demo). Common aspects of many distortions include the absence of spectro-temporal regions, or the presence of additive noise. Missing data occurs naturally in conditions such as telephone speech and channel fade-outs. Missing data speech recognition is a recent approach (Cooke et al, 1994) to robust ASR based on solutions to the following to subproblems:

Solutions to the first subproblem range from existing noise-reduction techniques such as spectral subtraction and local SNR estimation (Hirsch, 1993) through to more general-purpose approaches such as computational auditory scene analysis (Rosenthal & Okuno, 1998).

One approach to the second problem is demonstrated here. The purpose of the demo is to allow the user to explore the effect on recognition of removing spectro-temporal regions or adding noise to speech.

It is worth noting that the underlying HMM system is very simple (single component mixture) and makes mistakes even without missing data.

The demonstration

Type 'md' at the MATLAB prompt. Use the speech submenu of the load menu (1) to select a file to load. The speech files supplied with the demo are strings of digits (length 4 or 5) from the TIDigits corpus (ref). The identity of the digit sequence is apparent from its name.

[Aside: File names are of the form ti9o82, which signifies that this is the string "nine-oh-eight-two" from the TIDigits corpus. Recognition models exist for the 9 digits 1-9, together with 'o' for 'oh', 'z' for 'zero' and '-' indicating silence. For this demonstration, recognition is performed using rather simple single mixture HMMs with no use of derivatives, so performance is not perfect even without deletions!]

On loading, the system displays an auditorily-motivated spectrogram which is used as the sole basis for recognition (2). After a short while (longer on slow systems, but a few seconds on a PowerMac running at 200 MHz), the recognised result will appear in the lower panel (3). Displayed boundaries are those obtained by tracing back through the best sequence of models in the usual way.

EDITING THE DISPLAY

At this point, clicking anywhere on the spectrographic image results in an editing function determined by the popup menu (4). For instance, if the mode is erase, an area under the cursor is removed from the display each time the user clicks on the spectrogram. At this point, it is worth remarking that the demo acts like a spreadsheet: changes made restart the recogniser, which updates the recognised result 'almost instantaneously'. If you are using a slow machine (defined here as one which takes more than a few seconds to perform the recognition), you may wish to untick the recalculate checkbox (5) until you wish recognition to take place.

The extent of the region affected by clicking in the display is governed by the tool menu (6). Here, you can select the region affected by an edit operation by specifying the frequency region (in channels) and the duration (in 10 ms time frames).

Other edit options are add_noise, which adds noise at the level specified (relative to the signal mean) by popup menu (10), and restore, which restores the original values. In addition, the whole display can be reset at any time by pressing the restore button (8).

Removing or restoring data has the obvious effect on the display. A status line at the top right of the display indicates the percentage of missing time-frequency regions. The visual effect of adding noise depends on the noise level relative to the speech background. The global SNR is changed by any noise addition, and is indicated on the display. In addition, the user can set a required global SNR and the noise background will be modified to meet this criterion (11).

RECOGNITION

The manner in which the missing and masked data is recognised is controlled by the options available under the mask (7) and recognition (9) menus. The default recogniser is a conventional strategy which treats the data as being complete. That is, regions deleted by the user are treated as having value zero (energy). Similarly, added noise is part of the data fed to the recogniser. Setting the missing data option via the recognition menu instructs the recogniser to treat certain regions as missing. In the simplest case, these regions are just those which the user has deleted from the display. However, for the case of added noise, the user can force the recogniser to treat those regions with a negative local SNR as missing by selecting the snr > 0 option via the mask menu. The display is immediately updated to reveal those regions which meet this criterion.

A further recognition option is to use just spectral peaks (or the subset of them not missing or masked by noise) in the recognition process. This option is obtained via the peaks options under the masks menu. Again, the display is updated to show the points upon which recognition takes place.

INSPECTING THE HMMS AND OUTPUT PROBABILITIES

Certain aspects of the recognition models and process can be visualised (12):

PRESET DELETIONS AND NOISE

It is possible to apply certain deletion patterns without using the mouse to select regions. These are available under the load deletions submenu and include

In addition, noise patterns (at present, these are just other speech files) can be loaded from the load>noise submenu.

In the near-future, the demo will be enhanced to include auditory induction effects and perhaps include a range of automatic local SNR techniques.

Things to investigate

  1. Load a speech file and observe the recognition result. Apply a preset deletion pattern. Again, look at what happens to the recognition performance. Turn missing data recognition on. This ought to improve the recogniser's performance.
  2. Examine the effect of adding noise. Again, load a speech file, but this time add some noise patches with the mouse. Recognition should deteriorate (if it doesn't, you're not adding enough noise!). Then, set the display to snr > 0 from the Mask menu and switch missing data recognition on. This ensures that the system uses only those points whose SNR is positive.

References

  1. Cooke et al (1994). ICSLP, 1555-1558.
  2. Hirsch & Ehrlicher (1995). ICASSP, 153-156.
  3. Rosenthal & Okuno (1998). Computational auditory scene analysis. LEA.

Further reading


Credits etc

Produced by: Martin Cooke

Release date: June 22 1998

Permissions: This demonstration may be used and modified freely by anyone. It may be distributed in unmodified form.