Inference of Optimal Auditory Representations

Principal Investigator: Steve Renals

Funded by the Nuffield Foundation from 1 June 1995

The construction of acoustic models for speech recognition typically involves building neural network or statistical models of words or phonemes reaching down to a parameterised representation of the speech signal, usually based on a spectral analysis of the speech signal. Such analyses produce a compact representation of the speech signal---the typical data rate is a few kilobytes per second---but in so doing lose much information that may be important for recognition in noisy or cluttered acoustic environments. This proposal is concerned with building statistical models using the detailed time domain information that may be obtained from models of auditory processing.

The auditory model to be investigated is the Auditory Image Model (AIM) introduced by Patterson et al. (1994). The auditory image is based on a computational model of the inner ear that models the cochlea as a bank of ``gammatone'' filters, followed by a two dimensional adaptation component that acts as a transducer, mapping the simulated cochlear motion into a multi-channel pattern of activity representing the neural spike train flowing from the cochlea to the cochlea nucleus. This spike train is an unstable or periodic pattern. The auditory image stabilises the representation by carrying out an adaptive temporal integration process. This periodicity-sensitive temporal integration is triggered by spike train activity in each channel, and results in a stable auditory image, that does not smear out the temporal fine structure (unlike a non-adaptive moving average form of integration).

The cost of the AIM is the increased data rate. A typical auditory image may use 100 channels, with several hundred time domain samples at each channel. With a new image computed every 20ms (for example) this results in a data rate three orders of magnitude larger than currently used front ends. Preliminary investigations in which the AIM has been used for speech recognition have reduced the data by averaging and downsampling, thus losing much of the information contained in the fine structure of the image. The goal of this research is to develop methods that can substantially reduce the data rate, while retaining information contained in the fine structure.

Although the proposal is directed towards developing {\it automatic} methods, it is crucial that any prior knowledge about the problem is used correctly. Auditory images have a good deal of structure. We may separate this into structure within an image, and the structure of successive images. An individual auditory image possesses two basic types of structure, horizontal and vertical. The horizontal structure of an image corresponds to the activity in each individual channel, whereas the vertical structure contains pitch information. There is a great deal of continuity between images: the breakdown of this continuity signals some kind of auditory transition (or auditory event in the language of Morgan et al. (1994)). Here, we propose to model both the ``static'' and ``dynamic'' structure of auditory images.

Static Structure: We may regard the development of lower dimensional representations of auditory images as a coding problem. Simple methods, that do not capture the fine structure include averaging and principal components analysis which seeks projections in the directions of maximum variance. We propose to investigate more powerful nonlinear techniques. In neural population coding, the inferred representation is given by the units of a neural network autoencoder (Zemel and Hinton, 1994). In addition to having a location in the input image space, each hidden unit has a location in a lower dimensional {\it implicit space}. A possible implicit space for auditory images might be pitch, loudness, timbre and rhythm. In the proposed method however, the dimensionality of the implicit space is specified, but its basis is inferred from the data, along with the other parameters. The prior information about the structure of auditory images can be incorporated by developing an appropriate prior probability distribution that reflects the horizontal and vertical structure. This prior is then applied as a regularization term during inference.

Dynamic Structure: The structure of a sequence of auditory images may be modelled by neural networks trained as predictors. When the prediction breaks down (has a large error) is an indication that a significant auditory transition has occured. We may use this information to produce a variable image rate sequence, in which images that may be predicted from the previous image are discounted as providing no new information. This approach follows is related to the stochastic event-based model of Morgan et al. (1994).

The proposed research is highly novel. Detailed auditory models have not been used in speech recognition due to the high data rate that is seemingly inherent in their detailed structure. It is proposed to investigate recently reported nonlinear methods to perform a reduction in data rate, while retaining information contained in the structure of auditory images. If successful, this research could have a substantial impact in widening the scope of speech recognition to noisy situations and situations with multiple sources.