An important problem in speech processing is to detect the presence of speech in a background of noise. This problem is often referred to as the endpoint location problem [1]. The accurate detection of a word's start and end points means that subsequent processing of the data can be kept to a minimum. Consider the speech recognition technique based on template matching. The exact timing of an utterance will generally not be the same as that of the template. They will also have different durations. In many cases the accuracy of alignment depends on the accuracy of the endpoint detections.
In order to perform well, the algorithm must take a number of special situations into account such as:
- Words which begin or end with low-energy phonemes (weak fricatives).
- Words which end with an unvoiced plosive.
- Words which end with a nasal.
- Speakers ending words with a trailing off in intensity or a short breath (noise).
The method proposed in [1] and used in this demonstration uses two measures of the signal - the zero crossing rate and the energy.
Three thresholds are computed:
- ITU - Upper energy threshold.
- ITL - Lower energy threshold.
- IZCT - Zero crossings rate threshold.
For more information on how these are computed, see Rabiner and Sambur [1].The method proceeds as follows. Search from the beginning until the energy crosses ITU. Then backoff towards the signal beginning until the first point at which the energy falls below ITL is reached. This is the provisional beginning point - N1. N2 (the end point) is selected in a similar way. For the beginning point, now examine the previous 250ms of the signal's zero-crossing rate. If this measure exceeds the IZCT threshold 3 or more times, N1 is moved to the first point at which the IZCT threshold is exceeded. N1 is defined as the beginning point. Again, perform a similar method for the end point N2.
For a more detailed explanation of the algorithm refer to Rabiner and Sambur [1].Note: For the algorithm to perform correctly, the first 100ms of the speech signal must contain no speech.
![]()
Type 'epd' to launch the demo. When the window appears, use the load menu (1) to load a sound file. The signal can be played by clicking anywhere within the signal axes. The set of cursors allow the user to estimate their own endpoints. The option of playing the selected portion of the signal on release of the cursors is available (3). To the right of the axes (2), are 3 toggle buttons. The top two - Show Plots and Show Thresholds - provide the user with clues. The former displaying the energy and zero crossing rates and the latter providing even more information by showing the thresholds ITU, ITL and IZCT. With both of these buttons depressed, it is simple to find the endpoints. Pressing the third button - Show Endpoints - displays the computed endpoints in magenta on the signal axes.
The buttons can only be pressed in the order of increasing information displayed.
- Do your estimates always match well with the computed endpoints? If not, why do think this happens? (Consider the cases outlined above)
[1] Rabiner, L.R. and Sambur, M.R., "An Algorithm for Determining the Endpoints of Isolated Utterances". The Bell System Technical Journal, Vol. 54, No. 2, February 1975, pp. 297-315.
Produced by: Stuart N Wrigley
Release date: January 20 1999
Permissions: This demonstration may be used and modified freely by anyone. It may be distributed in unmodified form.