Listeners are capable of perceiving speech in the face of quite severe distortions (see the distorted speech demo). Common aspects of many distortions include the absence of spectro-temporal regions, or the presence of additive noise. Missing data occurs naturally in conditions such as telephone speech and channel fade-outs. Missing data speech recognition is a recent approach (Cooke et al, 1994) to robust ASR based on solutions to the following to subproblems:
- identification of reliable evidence; and
- adapting recognisers to handle missing and masked data.
Solutions to the first subproblem range from existing noise-reduction techniques such as spectral subtraction and local SNR estimation (Hirsch, 1993) through to more general-purpose approaches such as computational auditory scene analysis (Rosenthal & Okuno, 1998).
One approach to the second problem is demonstrated here. The purpose of the demo is to allow the user to explore the effect on recognition of removing spectro-temporal regions or adding noise to speech.
It is worth noting that the underlying HMM system is very simple (single component mixture) and makes mistakes even without missing data.
Type 'md' at the MATLAB prompt. Use the speech submenu of the load menu (1) to select a file to load. The speech files supplied with the demo are strings of digits (length 4 or 5) from the TIDigits corpus (ref). The identity of the digit sequence is apparent from its name.
[Aside: File names are of the form ti9o82, which signifies that this is the string "nine-oh-eight-two" from the TIDigits corpus. Recognition models exist for the 9 digits 1-9, together with 'o' for 'oh', 'z' for 'zero' and '-' indicating silence. For this demonstration, recognition is performed using rather simple single mixture HMMs with no use of derivatives, so performance is not perfect even without deletions!]
On loading, the system displays an auditorily-motivated spectrogram which is used as the sole basis for recognition (2). After a short while (longer on slow systems, but a few seconds on a PowerMac running at 200 MHz), the recognised result will appear in the lower panel (3). Displayed boundaries are those obtained by tracing back through the best sequence of models in the usual way.
At this point, clicking anywhere on the spectrographic image results in an editing function determined by the popup menu (4). For instance, if the mode is erase, an area under the cursor is removed from the display each time the user clicks on the spectrogram. At this point, it is worth remarking that the demo acts like a spreadsheet: changes made restart the recogniser, which updates the recognised result 'almost instantaneously'. If you are using a slow machine (defined here as one which takes more than a few seconds to perform the recognition), you may wish to untick the recalculate checkbox (5) until you wish recognition to take place.
The extent of the region affected by clicking in the display is governed by the tool menu (6). Here, you can select the region affected by an edit operation by specifying the frequency region (in channels) and the duration (in 10 ms time frames).
Other edit options are add_noise, which adds noise at the level specified (relative to the signal mean) by popup menu (10), and restore, which restores the original values. In addition, the whole display can be reset at any time by pressing the restore button (8).
Removing or restoring data has the obvious effect on the display. A status line at the top right of the display indicates the percentage of missing time-frequency regions. The visual effect of adding noise depends on the noise level relative to the speech background. The global SNR is changed by any noise addition, and is indicated on the display. In addition, the user can set a required global SNR and the noise background will be modified to meet this criterion (11).
The manner in which the missing and masked data is recognised is controlled by the options available under the mask (7) and recognition (9) menus. The default recogniser is a conventional strategy which treats the data as being complete. That is, regions deleted by the user are treated as having value zero (energy). Similarly, added noise is part of the data fed to the recogniser. Setting the missing data option via the recognition menu instructs the recogniser to treat certain regions as missing. In the simplest case, these regions are just those which the user has deleted from the display. However, for the case of added noise, the user can force the recogniser to treat those regions with a negative local SNR as missing by selecting the snr > 0 option via the mask menu. The display is immediately updated to reveal those regions which meet this criterion.
A further recognition option is to use just spectral peaks (or the subset of them not missing or masked by noise) in the recognition process. This option is obtained via the peaks options under the masks menu. Again, the display is updated to show the points upon which recognition takes place.
Certain aspects of the recognition models and process can be visualised (12):
- Visualise>HMMs plots the means/variances in each HMM state
- Visualise>Output probs displays the per-frame output probabilities. This is updated when changes are made to the recogniser's input.
It is possible to apply certain deletion patterns without using the mouse to select regions. These are available under the load deletions submenu and include
- random deletion at the 50% and 90% level
- random removal of frames or channels at 50% or 90%
- removal of all low energy points
- lowpass, bandpass or highpass filtering
- removal of all but a couple of single frequency bands
In addition, noise patterns (at present, these are just other speech files) can be loaded from the load>noise submenu.
In the near-future, the demo will be enhanced to include auditory induction effects and perhaps include a range of automatic local SNR techniques.
- Load a speech file and observe the recognition result. Apply a preset deletion pattern. Again, look at what happens to the recognition performance. Turn missing data recognition on. This ought to improve the recogniser's performance.
- Examine the effect of adding noise. Again, load a speech file, but this time add some noise patches with the mouse. Recognition should deteriorate (if it doesn't, you're not adding enough noise!). Then, set the display to snr > 0 from the Mask menu and switch missing data recognition on. This ensures that the system uses only those points whose SNR is positive.
- Cooke et al (1994). ICSLP, 1555-1558.
- Hirsch & Ehrlicher (1995). ICASSP, 153-156.
- Rosenthal & Okuno (1998). Computational auditory scene analysis. LEA.
- An easy introduction to the technique is given in Cooke et al (1997), ICASSP, 863-866. For more recent research updates, see Cooke's home page.
Produced by: Martin Cooke
Release date: June 22 1998
Permissions: This demonstration may be used and modified freely by anyone. It may be distributed in unmodified form.