10 December 2008
Michael Mandel
LabROSA, University of Columbia, New York, USA
Model-based EM Source Separation and Localization in Reverberant Mixtures
This talk will describe our system for separating and localizing multiple sound
sources from a reverberant two-channel recording. The talk begins with a
characterization of the interaural spectrogram for single source recordings, and a
method for constructing a probabilistic model of interaural parameters that can be
evaluated at individual spectrogram points. Multiple models can then be combined
into a mixture model of sources and delays, which reduces the multi-source
localization problem to a collection of single source problems. The talk will
then outline an expectation maximization algorithm for finding the
maximum-likelihood parameters of this mixture model, and show that these
parameters correspond well with interaural parameters measured in isolation. As a
byproduct of fitting this model, the algorithm creates probabilistic spectrogram
masks that can be used for source separation. In experiments performed in
simulated anechoic and reverberant environments, our system on average improved
the direct-path signal-to-noise ratio of target sources by 2.1 dB more than two
comparable algorithms.
3 December 2008
Robert P. Morse
School of Life and Health Sciences, Aston University
Stochastic neural coding and implications for cochlear implant design
Randomness in a system can have many benefits and researchers realized in the
early 1990s that Nature may have been taking advantage of these benefits for
millions of years - an area of research associated with stochastic resonance and
the periodicity of the ice ages. Sensory systems such as the hair cells in the
ear are notoriously noisy and the electrical signals resulting from the sound
signals are swamped by random fluctuations, caused for example by the Brownian
motion of the stereocilia. Auditory scientists often consider this noise was a
hindrance, but the results from stochastic resonance suggest that it might
actually be an essential part of normal coding. The hair cells, however, are
physiologically very vulnerable and in many forms of profound deafness they have
been completely destroyed; so, the sources of noise present in normal hearing are
absent in the deafened ear. We have therefore advocated that cochlear implants
may be improved by deliberately adding noise to the signals that go to the
electrodes. In computational studies we have shown that the addition of noise to
cochlear implant signals results in nerve activity that more closely resemble that
evoked by acoustic stimulation; moreover, the amount of information transmitted by
the cochlear nerve can be increased.
In this talk I will review the reasons why noise may be beneficial for cochlear
implants and the technical problems that must be overcome to make the method
practical. I will also show that another form of noise present in normal hearing
can theoretically lead to the cochlear nerve transmitting even more information
than noise that simply adds to the signal, as in standard stochastic resonance.
12 November 2008
Roddy Cowie
Queens University, Belfast, UK
The road to conversation with a computer
Conversation is not simply taking turn about to transmit messages. The
explicit messages that people exchange are embedded in interchanges at
multiple other levels. That is well known in principle, but the details
remain frustratingly incomplete in various different ways. We became
interested in forming an inclusive picture during the 1980s, through work
on sociolinguistics, and have returned to it recently in a project that
aims to achieve sustained conversation between a human and a computer,
called SEMAINE. We hope to carry it forward in a new project on social
signal processing beginning next year.
It would be useful if new technological efforts could access a
description of conversation that was rich enough to pre-empt unhelpful
simplifications.
Conversation can be thought of as transmitting several kinds of
information. Explicit messages are the most obvious element. Linguistics
has stressed that control information (about continuing, yielding the
floor, and so on) is also needed. Sociolinguistics stressed that
information about enduring personal qualities (background, affiliations,
etc) is also routinely transmitted. Other types of enduring personal
quality, such as personality, are also signalled (not necessarily
reliably). We have been part of a recent surge in interest in shorter term
personal conditions, most obviously emotion, but also stances (friendly,
impatient), epistemic states (interested, sure), and what might be called
concerns (the things that matter to the person in the current
situation).
The transmission will typically be in a context where certain ends are
to be achieved. Pure interchange of information may be the goal, but others
are very common. Social goals such as defining status, getting acquainted,
and forming bonds are pervasive. Practical goals include managing joint
action persuading. Limiting exchange, by concealing, deceiving or escaping,
is extremely important. Conversation may also be a means of entertainment,
or simply passing time.
The medium of transmission is a separate issue (that is often concealed
by the term nonverbal communication). Most types of information can be
transmitted via multiple media words, mode of speech, facial signals, and
gestures. The transmission may be at various levels - item selection (what
you say or do it), item execution (how you say or do it), and sequence or
interspersion (in what context you say or do it). The execution of almost
any action can become communicative walking, standing, or knocking at a
door.
Reading these signals is also a multi-level problem. There is a machine
perception task of detecting significant classes of event. Sometimes there
are univocal translation rules from a class of event to a meaning. But
decoding is often more akin to abduction seeking out a satisfactory
explanation of the signs.
Some of these issues have been studied within linguistic paradigms.
Others are difficult to address without large databases and sophisticated
signal processing. Connecting the paradigms is an interesting challenge in
itself.
Clearly human-computer interaction will use simplified models of
conversation for the foreseeable future. But it is intellectually more
satisfying to have an understanding of the simplifications that are being
made, and there are situations where it may be practically important.
5 November 2008
Mark Huckvale
Speech, Hearing and Phonetic Sciences, UCL, UK
Building Computational Models of Perception with Hierarchical Prediction Networks
Computational models of audio and visual perception are still unable to
recreate the ability of the human perceptual system to represent the
world in terms of mental objects. Where a machine finds a
spectro-temporal pattern, we hear words; where a machine finds light and
dark pixels, we see a three-dimensional room full of tables, chairs and
people. Our perceptual system delivers an interpretation of sense data
in terms of discrete objects in an automatic and seemingly effortless
process. Recreating this ability in a computational model is still a
long-term goal of machine perception.
In this talk I will first present some old ideas about the role of
prediction in learning to perceive, then explain how a hierarchical
model of prediction leads automatically to internal representations that
are similar to mental objects. I will briefly discuss some supporting
evidence from neuroscience for the role of prediction, and demonstrate a
network trained on a hierarchical sequence prediction task.
29 October 2008
Ning Ma
Department of Computer Science, University of Sheffield
Active listening in auditory scenes
Most Automatic Speech Recognition (ASR) technology fails in natural (i.e. noisy
and unpredictable) listening environments. In contrast, human listening functions
remarkably well in such environments where multiple sound sources are competing for
the listener's attention. It has been proposed that this human ability is governed
by Auditory Scene Analysis (ASA) processes, in which a sound mixture is segregated
into perceptual streams by a combination of bottom-up and top-down processing. This
talk investigates an approach, which takes inspiration from these human perception
processes, to the problem of separating and recognising speech in the presence of
multiple sound sources.
This approach operates by exploiting two levels of processing which combine to
simultaneously separate and interpret sound sources. The first processing level
employs signal-driven analysis motivated by models of auditory processing to
identify spectral-temporal regions belonging to individual sources. Such a region
is called a `fragment'. In this work, tree-like structures in the auto-correlogram
are exploited to track the pitch of simultaneous sound sources and the fragments
are generated based on pitch continuity. Such analysis is largely common to all
sounds and can be modelled without having to first identify the sound source. The
second processing level uses statistical speech models to simultaneously searches
for the best sequence of speech models and the best subset of fragments to be
identified with the speech models.
Evaluated on small-vocabulary speech recognition experiments, the proposed
system produces word error rates significantly lower than conventional ASR systems
over a range of signal-to-noise ratios. In the end, this talk will present some
future directions towards developing a general computational framework for active
listening in auditory scenes.
PPT slides
Autocorrelogram movies: clean
Autocorrelogram movies: slow motion
Autocorrelogram movies: two talker
22 October 2008
James Carmichael
Department of Information Studies, University of Sheffield
Quantifying speech disorder diagnosis
Speech disorder diagnostic procedures have traditionally been based on subjective
assessment, a form of evaluation which can prove to be psychometrically weak and
not reproducible. To reduce the likelihood of such perceptual inconsistency, a
computerised system of objective acoustic speech measurement metrics is proposed,
designed specifically for the Frenchay Dysarthria Assessment (FDA) series of diagnostic
tests. These objective measures are based on deviance-from-norm template matching:
various acoustic features, such as pitch contour, are extracted from the patient's oral
response to test stimuli and the discrete values derived from these features are analysed
to determine how much they vary from some pre-established norm. It is demonstrated that
the pattern and magnitude of the observed variations indicate the type and severity of
the dysarthric condition manifested.
Upon completion of the FDA assessment procedure, the resulting scores and medical
notes describing the patient's performance for each individual test are processed by an
expert system along with a pre-trained multi-layer perceptron (MLP); these
two classifiers operate in conjunction to return an overall diagnosis of the type and
severity of dysarthria which is apparently manifested. When tested on FDA assessment data
from 85 individuals whose dysarthric conditions have been independently confirmed by
expert clinicians, this hybrid system's diagnostic correctness is demonstrated to be
90.6% under certain conditions.
PPT slides
Sample sound: Phonation
Sample sound: Respiration at rest
James Carmichael's PhD thesis
A glance at a visual scene enables observers to become rapidly aware of
its most important characteristics. In hearing too, there is evidence
that listeners may obtain a rapid impression of the contents of an
auditory scene. Here, we describe experiments using very brief segments
of natural speech which demonstrate that a surprising amount of
information can be determined from only a few milliseconds of the
auditory signal. Segments were extracted from six vowels and six
fricatives spoken by males and females, with duration ranging from 2.5
to 80 ms. Listeners identified the phoneme and/or gender, or whether a
vowel or consonant had been presented. Although performance dropped
close to chance for the shortest (2.5 ms) stimuli for most tasks, for
the vowel/fricative distinction listeners obtained above 70% correct
performance even for such short segments and, for three out of four
tasks, performance was well above chance level for the 10 ms stimuli.
Threshold values from logistic fits provide an indication of the order
in which information becomes available to the listener: vowel/fricative
distinction (3.0 ms), followed by voicing (6.7 ms), phoneme
identification (11.9 ms) and gender identification (15.3 ms). Except for
the shortest durations, listeners could distinguish vowels and
fricatives equally well when presented with all 12 phoneme choices as
when given a choice of only vowel or fricative.
In a second experiment, listeners identified the talker from segments
with duration from 10 to 80 ms. Threshold values for talker
identification were 13.5 ms for British listeners, and 19.4 ms for
non-British listeners. An indirect measure of gender identification
(determined from the talker selected by the listener) showed that the
gender of the talker could be detected better by British listeners
(threshold 11.8 ms) than non-British listeners (threshold 16.1 ms),
suggesting that British listeners did not just use pitch when
determining the talker. Overall, the results are consistent with the
idea that listeners may be able to obtain the 'gist' of an auditory
scene rapidly and use this to constrain further interpretation.
1 October 2008
Mark Elshaw
Department of Computer Science, University of Sheffield
A gated recurrent self-organisation working memory model for emergent speech representation
With the many collections of speech and language corpora of business and
government-funded projects, it is possible to demonstrate basic automatic
speech recognition capabilities for a wide range of different spoken languages.
However, existing systems are overly restrictive and are quite brittle,
requiring their users to follow a very strict procedure to utilise spoken
language applications. While there is agreement about the need for a novel
approach to automatic speech recognition, there is no real consensus about the
most promising direction. This talk suggests one such approach towards the
ACORNS project overall goal of a novel speech recognition approach that makes
use of the growing body of knowledge about human cognitive processing through a
gated attention recurrent working memory architecture.
The working memory architecture combines an approached based on reinforcement
learning to differentiate between speech and non-speech signals that can act as
an attention mechanism so only speech is introduced into the working memory
model. The second element of the architecture provides a recurrent
self-organising approach to working memory representation of speech signals.
The biology inspiration of the gated attention self-organising recurrent
working memory architecture comes from various neuroscience systems such as
dopamine based actor-critic reinforcement for attention, neurocognitive
evidence on word representation in the brain and the structure and organisation
of working memory and the cerebral cortex.
PPT slides