Auditory scene analysis: Listening to several things at once

".. hearing every sound: the rustle of the leaves, the clear splashing of the spring, the whistling of the birds, the crack of dry twigs under the tread of unseen animals. And farther away he heard the whispering of the wind through the dry grass ... while golden eagles soared, crying, high above the crags. He listened to this multifarious harmony, and amidst all those sounds and noises, he heard the song of a child."
Hans Bemman

In most listening environments, a mixture of sounds reaches our ears. For example, at a crowded party there are many competing voices and other interfering noises, such as music. Similarly, the sound of an orchestra consists of a number of melodic lines that are played by a variety of instruments. Nonetheless, we are able to attend to a particular voice or a particular musical instrument in these situations. How does the ear achieve this apparently effortless segregation of concurrent sounds?

This phenomenon was noted by E. C. Cherry in 1953, who called it the `cocktail party problem'. Since then, the perceptual segregation of concurrent sounds has been the subject of extensive psychological research. In 1990, an influential account of this work was published by Albert Bregman of McGill University, Montreal. In his book, Bregman contends that the mixture of sounds reaching the ears is subjected to a two-stage auditory scene analysis (ASA). In the first stage, the acoustic signal is decomposed into a number of `sensory elements'. Subsequently, elements that are likely to have originated from the same environmental source are grouped into perceptual structures that can be interpreted by higher-level processes (such as those involved in speech understanding).

The Speech and Hearing Research Group at Sheffield was one of the first to investigate ASA from a computer modelling perspective and has been instrumental in establishing the field of computational auditory scene analysis (CASA). Computer modelling studies have intrinsic scientific merits, in that they help to clarify the information-processing problems involved in auditory perception. Additionally, CASA has a number of practical applications. The performance of automatic speech recognition systems is currently rather poor in the presence of background noise, yet human listeners are quite capable of following a conversation in a noisy environment. This suggests that models of auditory processing could provide a robust front-end for automatic speech recognition systems. Similarly, some human listeners with abnormal hearing have difficulty in understanding speech in noisy environments. These listeners may be helped by an `intelligent' hearing aid that can attenuate noises, echoes and the sounds of competing talkers, while amplifying a target voice.

Auditory scene analysis

Bregman makes a distinction between two type of auditory grouping; namely primitive grouping and schema-driven grouping. Primitive grouping is driven by the incoming acoustic data, and is probably innate. In contrast, schema-driven grouping employs the knowledge of familiar patterns and concepts that have been acquired through experience of acoustic environments.

Many primitive grouping principles can be described by the Gestalt principles of perceptual organisation. The Gestalt psychologists proposed a number of rules governing the manner in which the brain forms mental patterns from elements of its sensory input. Although these principles were generally described first in relation to vision, they are equally application to audition. For example, the Gestalt principle of closure refers to a tendency to complete (close) perceptual forms. We should expect the same principle to apply in audition, since sounds are regularly masked by other sounds. Indeed, it has been shown that listeners are able to perceptually restore parts of a quieter sound that have been masked by a louder sound, a process known as auditory induction.

Another potent Gestalt principle is `common fate', which states that elements changing in the same way at the same time probably belong together. There is good evidence that the auditory system exploits common fate by grouping acoustic components that exhibit changes in amplitude at the same time. Similarly, grouping by harmonicity can be phrased in terms of the Gestalt principle of common fate. When a person speaks, the vibrations of their vocal chords generates energy at the fundamental frequency of vibration and also at integer multiples (harmonics) of this frequency. Hence, the components of a single voice can be grouped by identifying acoustic components that have a common spacing in frequency (i.e. harmonics of the same fundamental).

Auditory representations

The peripheral auditory system - consisting of the outer, middle and inner ear - serves to transform acoustic energy into a neural code in the auditory nerve. Individual auditory nerve fibres are frequency-tuned; that is, each fibre will respond maximally (at its highest firing rate) when presented with a tone of a specific frequency. To a first approximation, then, the auditory periphery is often considered to act as a bank of bandpass filters with overlapping pass bands. One way to characterise the auditory code is in terms of average firing rate across the range of frequencies represented by the fibres; a so-called rate representation.

Recent studies suggest that a rate characterisation cannot explain the perception of everyday sound sources, since the rate response of fibres saturates at moderate sound levels, leading to a loss of contrast in the firing patterns. However, auditory nerve fibres show a tendency to fire in phase which individual frequency components present in a stimulus. This `phase locked' response could be used to provide information about which frequency components are dominant even at high sound levels when the rate response has saturated. So, the auditory system may also use a timing representation.

One important representation of auditory temporal activity is the correlogram. Work at Sheffield has investigated the use of correlograms for separating concurrent sounds according to fundamental frequency. A correlogram is formed by computing a running autocorrelation of auditory nerve firing activity in each band of an auditory periphery model. The result is a two-dimensional representation, in which frequency and time lag (which corresponds to pitch period) are displayed on orthogonal axes.

The correlogram is one example of an auditory map. Throughout the higher auditory system, it appears that important acoustic parameters are represented in two- or three-dimensional arrays of cells. The value of the parameter at a particular frequency is indicated by the firing rate of the cell at the appropriate position in the neural array. We have developed computational models of several maps, including those that code frequency modulation (FM), amplitude modulation (AM), onset of energy and offset of energy.

Computational architectures for auditory grouping

Having obtained an auditory representation of a sound mixture, it is necessary to decide which elements in the auditory scene belong together, and which do not. One approach to this problem is a symbolic architecture. In this approach, the auditory scene is represented as a collection of discrete objects (which might correspond to harmonics or formants), and Gestalt grouping principles are used to partition these objects into groups. Resynthesis can be used to listen to one component of a mixture that has been isolated by the segregation process. A system based on so-called synchrony strands, developed at Sheffield, has proven to be a very influential symbolic architecture for CASA. In this system, the underlying representation is derived from analysis of the instantaneous frequency of auditory filter activity.

More recently, we have been investigating architectures for auditory grouping that have a closer relationship with the known physiology of the auditory system. How are groups of acoustic features actually represented in the brain? One possibility is that temporal synchronisation is used to code features that belong to the same acoustic source. In this scheme, neurons that code features of the same source have a synchronised firing response, (phase-locked with zero phase lag), which is also desynchronised from the neurons that code features of different sources. In particular, we have been researching how networks of neurons with an oscillatory firing behaviour - so-called neural oscillators - can represent auditory grouping.

The performance of neural oscillator models of auditory organisation is similar to that of their symbolic counterparts. However, the neural oscillator approach has an important advantage - it is better suited to implementation in hardware, such as analogue VLSI. Clearly, real-time performance is required if CASA is to be successfully applied in automatic speech recognition devices and advanced hearing prostheses.

Key publications
  • M. P. Cooke (1994) Modelling auditory processing and organisation. Cambridge University Press.
  • G. J. Brown and M. P. Cooke (1994) Computational auditory scene analysis. Computer Speech and Language, 8, pp. 297-336.
  • G. J. Brown and D. Wang (1997) Modelling the perceptual segregation of concurrent vowels with a network of neural oscillators. Neural Networks, 10 (9), pp. 1547-1558.
  • D. L. Wang & G. J. Brown (1999) Separation of speech from interfering sounds using oscillatory correlation. IEEE Transactions on Neural Networks, 10 (3), pp. 684-697. [PDF 394K]