SpandH seminar abstracts 2008
10 December 2008
LabROSA, University of Columbia, New York, USA
Model-based EM Source Separation and Localization in Reverberant Mixtures
This talk will describe our system for separating and localizing multiple sound sources from a reverberant two-channel recording. The talk begins with a characterization of the interaural spectrogram for single source recordings, and a method for constructing a probabilistic model of interaural parameters that can be evaluated at individual spectrogram points. Multiple models can then be combined into a mixture model of sources and delays, which reduces the multi-source localization problem to a collection of single source problems. The talk will then outline an expectation maximization algorithm for finding the maximum-likelihood parameters of this mixture model, and show that these parameters correspond well with interaural parameters measured in isolation. As a byproduct of fitting this model, the algorithm creates probabilistic spectrogram masks that can be used for source separation. In experiments performed in simulated anechoic and reverberant environments, our system on average improved the direct-path signal-to-noise ratio of target sources by 2.1 dB more than two comparable algorithms.
3 December 2008
Robert P. Morse
School of Life and Health Sciences, Aston University
Stochastic neural coding and implications for cochlear implant design
Randomness in a system can have many benefits and researchers realized in the early 1990s that Nature may have been taking advantage of these benefits for millions of years - an area of research associated with stochastic resonance and the periodicity of the ice ages. Sensory systems such as the hair cells in the ear are notoriously noisy and the electrical signals resulting from the sound signals are swamped by random fluctuations, caused for example by the Brownian motion of the stereocilia. Auditory scientists often consider this noise was a hindrance, but the results from stochastic resonance suggest that it might actually be an essential part of normal coding. The hair cells, however, are physiologically very vulnerable and in many forms of profound deafness they have been completely destroyed; so, the sources of noise present in normal hearing are absent in the deafened ear. We have therefore advocated that cochlear implants may be improved by deliberately adding noise to the signals that go to the electrodes. In computational studies we have shown that the addition of noise to cochlear implant signals results in nerve activity that more closely resemble that evoked by acoustic stimulation; moreover, the amount of information transmitted by the cochlear nerve can be increased.
In this talk I will review the reasons why noise may be beneficial for cochlear implants and the technical problems that must be overcome to make the method practical. I will also show that another form of noise present in normal hearing can theoretically lead to the cochlear nerve transmitting even more information than noise that simply adds to the signal, as in standard stochastic resonance.
26 November 2008
DCS, University of Sheffield
An animatronic tongue and vocal tract: AnTon
Today's speech and language technology is mainly driven by statistical models that are trained using large databases of speech. Since these databases usually consist of recorded speech sounds and since there is no one-to-one relation between speech sounds and articulatory configurations, it can be argued that this modelling process is rather disconnected from underlying speech production processes. In order to investigate human speech production in detail, and thereby provide data for future improvements of speech technology, a bio-inspired robotic speaker was designed. 'AnTon' stands for "Animatronic Tongue and Vocal Tract" and it is unique in that it mimics human anatomy rather than functionality. This means that the model's behaviour can be mapped directly to human speech articulation, provided that it is carefully calibrated and evaluated. The present state of AnTon's development will be presented, as well as an outlook on the near future of the project given.
19 November 2008
DCS, University of Sheffield
The influence of audio presentation style on multitasking during teleconferences
Teleconference participants often multitask: they work on a text-based 'foreground' task whilst listening in the 'background' for an item of interest to appear. Audio material should therefore be presented in a manner that has the smallest possible impact on the foreground task without affecting topic detection. Here, we ask whether dichotic or spatialised audio presentation of a meeting is less disruptive than the single-channel mixture of talkers normally used in teleconference audio. A number of talker location configurations are used, and we examine how these impact upon a text-based foreground task: finding all letter 'e' occurrences in a block of text. Additionally, we examine the effect of cueing the listener to direction or gender and record listener preferences for audio presentation style. Our results suggest that spatialised audio disrupts the foreground task less than single-channel audio when direction or gender is cued. We conclude by describing future work and, specifically, the design of a new set of experiments.
12 November 2008
Queens University, Belfast, UK
The road to conversation with a computer
Conversation is not simply taking turn about to transmit messages. The explicit messages that people exchange are embedded in interchanges at multiple other levels. That is well known in principle, but the details remain frustratingly incomplete in various different ways. We became interested in forming an inclusive picture during the 1980s, through work on sociolinguistics, and have returned to it recently in a project that aims to achieve sustained conversation between a human and a computer, called SEMAINE. We hope to carry it forward in a new project on social signal processing beginning next year.
It would be useful if new technological efforts could access a description of conversation that was rich enough to pre-empt unhelpful simplifications.
Conversation can be thought of as transmitting several kinds of information. Explicit messages are the most obvious element. Linguistics has stressed that control information (about continuing, yielding the floor, and so on) is also needed. Sociolinguistics stressed that information about enduring personal qualities (background, affiliations, etc) is also routinely transmitted. Other types of enduring personal quality, such as personality, are also signalled (not necessarily reliably). We have been part of a recent surge in interest in shorter term personal conditions, most obviously emotion, but also stances (friendly, impatient), epistemic states (interested, sure), and what might be called concerns (the things that matter to the person in the current situation).
The transmission will typically be in a context where certain ends are to be achieved. Pure interchange of information may be the goal, but others are very common. Social goals such as defining status, getting acquainted, and forming bonds are pervasive. Practical goals include managing joint action persuading. Limiting exchange, by concealing, deceiving or escaping, is extremely important. Conversation may also be a means of entertainment, or simply passing time.
The medium of transmission is a separate issue (that is often concealed by the term nonverbal communication). Most types of information can be transmitted via multiple media words, mode of speech, facial signals, and gestures. The transmission may be at various levels - item selection (what you say or do it), item execution (how you say or do it), and sequence or interspersion (in what context you say or do it). The execution of almost any action can become communicative walking, standing, or knocking at a door.
Reading these signals is also a multi-level problem. There is a machine perception task of detecting significant classes of event. Sometimes there are univocal translation rules from a class of event to a meaning. But decoding is often more akin to abduction seeking out a satisfactory explanation of the signs.
Some of these issues have been studied within linguistic paradigms. Others are difficult to address without large databases and sophisticated signal processing. Connecting the paradigms is an interesting challenge in itself.
Clearly human-computer interaction will use simplified models of conversation for the foreseeable future. But it is intellectually more satisfying to have an understanding of the simplifications that are being made, and there are situations where it may be practically important.
5 November 2008
Speech, Hearing and Phonetic Sciences, UCL, UK
Building Computational Models of Perception with Hierarchical Prediction Networks
Computational models of audio and visual perception are still unable to recreate the ability of the human perceptual system to represent the world in terms of mental objects. Where a machine finds a spectro-temporal pattern, we hear words; where a machine finds light and dark pixels, we see a three-dimensional room full of tables, chairs and people. Our perceptual system delivers an interpretation of sense data in terms of discrete objects in an automatic and seemingly effortless process. Recreating this ability in a computational model is still a long-term goal of machine perception.
In this talk I will first present some old ideas about the role of prediction in learning to perceive, then explain how a hierarchical model of prediction leads automatically to internal representations that are similar to mental objects. I will briefly discuss some supporting evidence from neuroscience for the role of prediction, and demonstrate a network trained on a hierarchical sequence prediction task.
29 October 2008
Department of Computer Science, University of Sheffield
Active listening in auditory scenes
Most Automatic Speech Recognition (ASR) technology fails in natural (i.e. noisy and unpredictable) listening environments. In contrast, human listening functions remarkably well in such environments where multiple sound sources are competing for the listener's attention. It has been proposed that this human ability is governed by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated into perceptual streams by a combination of bottom-up and top-down processing. This talk investigates an approach, which takes inspiration from these human perception processes, to the problem of separating and recognising speech in the presence of multiple sound sources.
This approach operates by exploiting two levels of processing which combine to simultaneously separate and interpret sound sources. The first processing level employs signal-driven analysis motivated by models of auditory processing to identify spectral-temporal regions belonging to individual sources. Such a region is called a `fragment'. In this work, tree-like structures in the auto-correlogram are exploited to track the pitch of simultaneous sound sources and the fragments are generated based on pitch continuity. Such analysis is largely common to all sounds and can be modelled without having to first identify the sound source. The second processing level uses statistical speech models to simultaneously searches for the best sequence of speech models and the best subset of fragments to be identified with the speech models.
Evaluated on small-vocabulary speech recognition experiments, the proposed system produces word error rates significantly lower than conventional ASR systems over a range of signal-to-noise ratios. In the end, this talk will present some future directions towards developing a general computational framework for active listening in auditory scenes.
22 October 2008
Department of Information Studies, University of Sheffield
Quantifying speech disorder diagnosis
Speech disorder diagnostic procedures have traditionally been based on subjective assessment, a form of evaluation which can prove to be psychometrically weak and not reproducible. To reduce the likelihood of such perceptual inconsistency, a computerised system of objective acoustic speech measurement metrics is proposed, designed specifically for the Frenchay Dysarthria Assessment (FDA) series of diagnostic tests. These objective measures are based on deviance-from-norm template matching: various acoustic features, such as pitch contour, are extracted from the patient's oral response to test stimuli and the discrete values derived from these features are analysed to determine how much they vary from some pre-established norm. It is demonstrated that the pattern and magnitude of the observed variations indicate the type and severity of the dysarthric condition manifested.
Upon completion of the FDA assessment procedure, the resulting scores and medical notes describing the patient's performance for each individual test are processed by an expert system along with a pre-trained multi-layer perceptron (MLP); these two classifiers operate in conjunction to return an overall diagnosis of the type and severity of dysarthria which is apparently manifested. When tested on FDA assessment data from 85 individuals whose dysarthric conditions have been independently confirmed by expert clinicians, this hybrid system's diagnostic correctness is demonstrated to be 90.6% under certain conditions.
15 October 2008
Department of Computer Science, University of Sheffield
Perception of very brief segments of speech
A glance at a visual scene enables observers to become rapidly aware of its most important characteristics. In hearing too, there is evidence that listeners may obtain a rapid impression of the contents of an auditory scene. Here, we describe experiments using very brief segments of natural speech which demonstrate that a surprising amount of information can be determined from only a few milliseconds of the auditory signal. Segments were extracted from six vowels and six fricatives spoken by males and females, with duration ranging from 2.5 to 80 ms. Listeners identified the phoneme and/or gender, or whether a vowel or consonant had been presented. Although performance dropped close to chance for the shortest (2.5 ms) stimuli for most tasks, for the vowel/fricative distinction listeners obtained above 70% correct performance even for such short segments and, for three out of four tasks, performance was well above chance level for the 10 ms stimuli. Threshold values from logistic fits provide an indication of the order in which information becomes available to the listener: vowel/fricative distinction (3.0 ms), followed by voicing (6.7 ms), phoneme identification (11.9 ms) and gender identification (15.3 ms). Except for the shortest durations, listeners could distinguish vowels and fricatives equally well when presented with all 12 phoneme choices as when given a choice of only vowel or fricative.
In a second experiment, listeners identified the talker from segments with duration from 10 to 80 ms. Threshold values for talker identification were 13.5 ms for British listeners, and 19.4 ms for non-British listeners. An indirect measure of gender identification (determined from the talker selected by the listener) showed that the gender of the talker could be detected better by British listeners (threshold 11.8 ms) than non-British listeners (threshold 16.1 ms), suggesting that British listeners did not just use pitch when determining the talker. Overall, the results are consistent with the idea that listeners may be able to obtain the 'gist' of an auditory scene rapidly and use this to constrain further interpretation.
8 October 2008
Department of Computer Science, University of Sheffield
A speech fragment approach to localising multiple speakers in reverberant environments
We present a real-time, single-source acoustic localisation system running on the POP audio-visual robot-head (POPeye) and discuss the algorithmic limitations of the system, when presented with more realistic acoustic environments, where multiple acoustic sources are active in the presence of reverberation. This causes sound source localisation cues to be severely degraded and more sophisticated extraction techniques are needed. We present a system for localising simultaneous speakers which exploits the fact that in a speech mixture there exist spectro-temporal regions or fragments, where the energy is dominated by just one of the speakers. A fragment-level localisation model is proposed that integrates the localisation cues within a fragment using a weighted mean. The weights are based on local estimates of the degree of reverberation in a given spectro-temporal cell. The paper investigates different weight estimation approaches based variously on, i) an established model of the perceptual precedence effect; ii) a measure of interaural coherence between the left and right ear signals; iii) a data-driven approach trained in matched acoustic conditions. Experiments with reverberant binaural data with two simultaneous speakers show appropriate weighting can improve frame-based localisation performance by up to 24%.
1 October 2008
Department of Computer Science, University of Sheffield
A gated recurrent self-organisation working memory model for emergent speech representation
With the many collections of speech and language corpora of business and government-funded projects, it is possible to demonstrate basic automatic speech recognition capabilities for a wide range of different spoken languages. However, existing systems are overly restrictive and are quite brittle, requiring their users to follow a very strict procedure to utilise spoken language applications. While there is agreement about the need for a novel approach to automatic speech recognition, there is no real consensus about the most promising direction. This talk suggests one such approach towards the ACORNS project overall goal of a novel speech recognition approach that makes use of the growing body of knowledge about human cognitive processing through a gated attention recurrent working memory architecture.
The working memory architecture combines an approached based on reinforcement learning to differentiate between speech and non-speech signals that can act as an attention mechanism so only speech is introduced into the working memory model. The second element of the architecture provides a recurrent self-organising approach to working memory representation of speech signals. The biology inspiration of the gated attention self-organising recurrent working memory architecture comes from various neuroscience systems such as dopamine based actor-critic reinforcement for attention, neurocognitive evidence on word representation in the brain and the structure and organisation of working memory and the cerebral cortex.
9 April 2008
Alain de Cheveigné
Cancellation in auditory scene analysis.
Auditory scene analysis is essential for survival, and evolutionary pressure to be proficient at this task may have shaped in part the skills that we apply today to speech and music perception. One important sound segregation mechanism seems to be cancellation, a process by which the acoustic structure of an interfering source is tapped to suppress it so as to better hear a weak target. I will review behavioral data that support this hypothesis, and I'll discuss a possible neural mechanism to implement it. This mechanism accurately predicts complex details of the behavioral data. I will show how this relates to a more general model of low level auditory signal processing, the correlation network, that unifies and generalizes s wide range of models of time domain processing within the auditory nervous system. I will conclude by speculating on how this low-level model might be inserted within a more general model of auditory perception and cognition.
16 January 2008
Department of Computer Science, University of Sheffield, UK
Language Identification: Insights from the Classification of Hand Annotated Phone Transcripts
Language Identification (LID) of speech can be split into two processes; phone recognition and language modelling. This two stage approach underlies some of the most successful LID systems. As phone recognizers become more accurate it is useful to simulate a very accurate phone recognizer to determine the effect on the overall LID accuracy. This can be done by using phone transcripts. In this paper LID is performed on phone transcripts from six different languages in the OGI multi-language telephone speech corpus. By simulating a phone recognizer that classifies phones into ten broad classes, a simple n-gram model gives low LID equal error rates (EER) of <1% on 30 seconds of test data. Language models based on these accurate phone transcripts can reveal insights into the phonology of different languages.