7 December 2005
Ray Meddis
Hearing Research Laboratory, Department of Psychology University of Essex, UK
A computer model of absolute threshold performance in human listeners
Behavioral absolute threshold for a pure tone stimulus in quiet can
not easily be defined in physiological terms. Auditory nerve (AN) rate
threshold is an inappropriate comparison because psychophysical
thresholds depend on the duration of the stimuli used. Traditional
temporal integration theories explain many psychophysical observations
but it remains unclear where the integration takes place. This
modeling project starts from the observation that auditory nerve
first-spike latency is reduced as the stimulus is increased in both
duration and level. This could be related to the dependence of
behavioral threshold on duration because more intense stimuli can be
detected at shorter durations. Unfortunately, near threshold, there
is no simple way to distinguish the first spike after stimulus onset
from spontaneous activity. However, if first-order relay neurons are
set to respond only to coincidental AN action potentials across a
number of fibers, they can be shown to have thresholds that depend
appropriately on stimulus duration. A computer model of the auditory
periphery is described that has realistic first-spike latency
properties. It also has firing thresholds that decline with increased
stimulus duration in a manner similar to psychophysical
observations. This is a temporal integration model where the
integration takes place using pre-synaptic calcium in inner hair
cells.
30 November 2005
Christopher Newell
Scarborough School of Arts, University of Hull, UK
Do we need synthetic speech to sound natural, in order to sound expresssive?
One goal of synthetic speech research is to produce human-like
speech undistinguishable from the real thing. In order to do this the
voice will be required to be individual and expressive. In this
research we argue that achieving human-like speech is not the only way
to produce an individual and expressive synthetic voice. Any voice,
including a machine voice, is interpreted by users within a
context. Framing a machine-like disembodied voice within the context
of live theatrical or musical performance circumvents problems of
naturalness and allows the voice to develop unconstrained expressive
possibilities similar to singing. This may operate as a legitimate
substitute for the synthesis of human vocal expressiveness.
Power
point slides
Associated sound clips:
23 November 2005
Yan-Chen Lu
Department of Computer Science, University of Sheffield, UK
My Multimedia Innovation Journey
This talk will present my recent work in the field of multimedia
communications to create an IP surveillance system on a System on a
Chip (SoC) platform. The presentation will describe work ranging from
algorithm design to integration onto physical circuits.
A digital signal processing algorithm can be implemented as
software on a PC, as middleware on an embedded platform and as an
integrated circuit. The chosen implementation is determined by cost
and specification. I will show how these realizations differ by
introducing my multimedia coding implementations. I will also discuss
the principles of popular auditory and visual compression techniques
to derive the corresponding design considerations. PCs are powerful
computing platforms that enable fast, fully-functioning
verification. However, they are expensive in cost and power compared
to digital signal processors (DSPs). In an optimized porting, it is
important to manage the data path and control according to the DSP
architecture. Application specific integrated circuits enable data
path design optimization with minimal effort in data-flow control,
sacrificing the flexibility of software code. A SoC platform uses the
IP-reuse concept and is integrated on a single chip. It can construct
a complex system within a reasonable design-time which is crucial for
a commercial product.
Power point slides
17 November 2005
Alan Newell
Research in Applied Computing, University of Dundee, UK
Systems for Older and Disabled People
This seminar will describe research into developing computer
systems to support older and disabled people.
Research Projects currently include:
- Innovative requirements gathering techniques with older and disabled
people
- Accessibility and usability of IT systems for a wide range of users
- Specially designed email and web browers
- Reminiscence and conversational support for people with dementia
- Applications of digital television for older and disabled people.
- Communication systems for non-speaking people
- Systems to support people with dyslexia
- Gesture and fall recognition, and lifestyle modeling
- The use of theatre in human computer interaction research and
usability studies.
The research follows the philosophy of "ordinary and extra-ordinary
human machine interaction" which is based on the parallels between a
non disabled (ordinary) person performing an extra-ordinary (high work
load) task and a disabled (extra-ordinary) person performing an
ordinary task, and other environments which "disable" ordinary people.
This work has led to the new concepts of "User Sensitive Inclusive
Design" and "Design for Dynamic Diversity".
For more information click here.
16 November 2005
Sue Denham
Centre for Theoretical and Computational Neuroscience, University of Plymouth, UK
Modelling the representation and classification of natural sounds
Our recent work has been directed towards developing a
representation of acoustic stimuli suitable for real time
classification of natural sounds using networks of spiking
neurons. The challenge is to find a representation which encapsulates
time varying spectrotemporal patterns in patterns of network activity
which can be read out and classified at the slower timescales typical
for cortical responses. The model we propose is based upon
biologically plausible processing including cochlear filtering, the
extraction of transients, and convolution with cortex-like
spectrotemporal response fields (STRFs). Derivation of useful STRFs is
achieved in a putative developmental stage through exposure to speech,
and the properties of the resulting response fields show a surprising
similarity to those measured experimentally. Salient events are
evident in the response of the ensemble of STRFs. The resulting
representation is capable of supporting multiple interpretations of
the input, as is a characteristic of human perception, e.g. awareness
of the words and speaker identity.
Power point slides
Representative papers covering topics in the talk:
2 November 2005
Piers Messum
University College London, UK
Learning to talk, but not by imitation
What moves a child's pronunciation in the direction of the adult
norm? The standard answer is that children imitate the speech models
that surround them and that they get better at this with age and
experience. Such imitation must take different forms, of course. So
for temporal phenomena (like 'rhythm' or changes in vowel lengths) the
child must abstract 'rules' which determine these effects in different
contexts, while for speech sounds he abstracts the features which make
sounds distinctive and then produces these with his own voice.
The assumption of imitation underlies much research on speech, but
it is problematic. I will present an alternative account of
phonological development, where imitation plays only a minimal role.
In its place, canalising pressures and reinforcement learning
mechanisms are sufficient to account for progress in pronunciation.
The canalising pressures arise from the embodiment of speech. That
the production apparatus is a child's body rather than an adult one is
the most significant factor.
Power
point slides
19 October 2005
Esmeralda Uraga
Department of Computer Science, University of Sheffield, UK
Experiments using acoustic and articulatory data for speech recognition
In conventional acoustic modeling approaches for speech
recognition, acoustic features are mapped to discrete symbols, such as
phonemes or other subword units. Acoustic representations of speech
are obtained from feature extraction techniques that make little or no
use of knowledge about the underlying speech production mechanism. It
has been suggested that using articulatory representations of speech
should allow for better recognition systems. However, previous
attempts to improve the performance of speech recognition systems
using direct information about the movement of articulators have been
unsuccessful.
This study compares the performance of several acoustic and
articulatory speech recognition systems evaluated on a multi-channel
acoustic-articulatory corpus (MOCHA). The results show that speech
recognition systems based on articulatory representations of speech
outperform MFCC-based systems in a speaker-dependent phone recognition
task. We have found that articulatory representations of speech
provide a comprehensive description of speech which is not only equal
to that of the acoustic signal but also contains information which may
not be present in a standard Mel Cepstrum based representation.
12 October 2005
Stephen Cox
School of Computing Sciences, University of East Anglia, Norwich, UK
Automatic Musical Genre Classification
The combination of efficient high quality music coding schemes, cheap portable players that can store thousands (soon to be millions) of tracks and growing access to broadband from home has caused a sea change in the way that recordings of music are acquired and used. Next year, income from Internet subscription services to music will exceed sales of CDs and vinyl recordings for the first time, and it is very likely that this gap will widen in the future. This phenomenon raises a set of interesting questions about how users can identify and organise the music that they want to hear. Whilst much music available on the Internet has metadata associated with it that provides information on (for instance) the associated genre, artist(s), song title etc., much does not, and it would be very useful to be able to obtain such metadata automatically from the music signal itself. Furthermore, a measure of the musical similarity between two pieces would be very useful to aid users in e.g.
organising their collections and constructing playlists. In addition to these practical applications, computational processing of music is a fascinating area of artificial intelligence. In this talk, I will describe our work in one aspect of this field, automatic classification of musical genre. I will discuss both how this field (and related
fields) can build on existing work in speech processing and how they
need to develop new techniques and paradigms. Descriptions of
experiments in automatic musical genre classification will be given,
along with results, and new research opportunities and directions will
be discussed.
Power point slides
Chief Scientist, Foveon Inc., USA
History and Future of Electronic Color Photography:
Where Vision and Silicon Meet
In the late twentieth century, developments in electronic color
photography employed color separation techniques recycled from
corresponding developments in silver halide photography of the late
nineteenth century. Multi-shot cameras, beam-splitter cameras, and
screen-plate or filter-mosaic cameras all had their day with film
about a century ago, and with electronic sensors more recently. The
multi-layer color sensing technique that dominated the twentieth
century, originally commercialized as Kodachrome, is now recapitulated
in the multi-layer silicon sensor introduced for the twenty-first
century as the Foveon X3 technology. These techniques for color
photography take their cues from human color vision, but ultimately
must listen to the silicon.
28 September 2005
Martin Cooke
Department of Computer Science, University of Sheffield, UK
Non-native speech perception in noise
Spoken communication in a non-native language is especially
difficult in the presence of noise. However, conflicting reports have
appeared concerning the degree to which non-natives suffer in noise
relative to natives. This study compared English and Spanish
listeners perceptions of English intervocalic consonants as a
function of both non-native phonological competence and masker
type. Three backgrounds (stationary noise, multi-talker babble and
competing speech) provided varying amounts of energetic and
informational masking. Competing English and Spanish speech maskers
were used to examine the effect of masker language. Non-native
performance fell short of that of native listeners in quiet, but a
larger performance differential was found for all masking
conditions. Both groups performed better in competing speech than in
stationary noise, and both suffered most in babble. Since babble is a
less effective energetic masker than stationary noise, these results
suggest that non-native listeners are more adversely affected by both
energetic and informational masking. The most competent Spanish
listeners in quiet were also the best in noise, but they also showed
the steepest drop in performance for the most difficult maskers. A
small effect of language background was evident: English listeners
performed better when the competing speech was Spanish.
26 July 2005
Nobuaki Minematsu
Department of Information and Communication Engineering, University of Tokyo, Japan
Theorem of the Invariant Structure and its Derivation of Speech Gestalt
Speech communication has several steps of production, encoding, transmission, decoding, and hearing. In every step, acoustic distortions are involved inevitably as differences of speakers, microphones, rooms, hearing characteristics, etc. These are non-linguistic factors and completely irrelevant to speech recognition. Although spectrogram always carries these factors, almost all the speech applications have been built on this "noisy"
representation. Recently, a novel representation of speech acoustics
is proposed, called the acoustic universal structure. What is
represented here is only the interrelations among speech events and
absolute properties of the events are discarded completely. It is very
interesting that the non-linguistic factors are removed effectively
from speech as cepstrum smoothing of spectrogram can remove pitch
information from speech.
In this talk, the theoretical backgrounds of the new speech
representation is described in detail from the viewpoints of
linguistics, psychology, acoustics and mathematics with some results
of recognition experiments and perceptual experiments. It will be
shown that the new representation can be viewed as speech
Gestalt. Finally, some strategic similarity of speech processing
between autistic people and the current speech recognizers is
discussed.
12 May 2005
Sarah Hawkins
Phonetics Laboratory, University of Cambridge, UK
Perceptual coherence and speech understanding,
what a speech perception model should look like
I will discuss acoustic-phonetic and perceptual data showing that
small differences in the speech signal reflect systematic phonological
and/or grammatical differences, and are perceptually salient. Instead
of just enhancing a particular phonological contrast, at least some of
this systematic phonetic variation seems to contribute to speech
intelligibility by making the overall signal more 'perceptually
coherent'. Perceptual coherence in speech encompasses properties of
the signal that are probably natural consequences of vocal-tract
dynamics (such as changes in excitation type as the articulators move
from a vowel to an obstruent consonant), as well as patterns that are
language- and accent-specific (such as, in English, widespread effects
on vowel formant frequencies due to the presence of an /r/ in the
utterance). Some effects last less than 50 ms, while others can last
for over 500 ms. Still others may last longer. Perceptual coherence
could facilitate perception by adding useful redundancy to particular
phonological contrasts, but another possibility is that its main value
is in grouping the signal so that it sounds as if it is coming from a
single source, and provides consistent information over long time
scales. I will discuss these possibilities, relating them to a
polysystemic approach to speech perception, Polysp. Compared with
standard phonetic and psycholinguistic models of speech perception,
Polysp de-emphasizes the role of phonology and contrasts in lexical
form, and emphasizes understanding speech within the general
communicative situation, linguistic and non-linguistic.
Non-essential background reading on the general approach, and Polysp.
All three are available (with printers' errors corrected!) from here.
- Hawkins, S. (2003). Contribution of fine phonetic detail to speech understanding. Proceedings of the 15th International Congress of Phonetic Sciences. 293-296.
- Hawkins, S. (2003). Roles and representations of systematic fine phonetic detail in speech understanding. Journal of Phonetics, 31, 373-405.
- Hawkins, S., & Smith, R. H. (2001). Polysp: A polysystemic, phonetically-rich approach to speech understanding. Italian Journal of Linguistics-Rivista di Linguistica, 13, 99-188.
5 May 2005
Roger Moore
20/20 Speech Ltd., UK
How Good Does Automatic Speech Recognition Have to Be? ... and when will it be that good?
Automatic Speech Recognition (ASR) is often hailed as the most
'natural' interface between humans and machines, and it has recently
been cited as a technology likely to have huge market success over the
next few years. However, are these views founded on a realistic
assessment of what the technology can actually do, or are they based
on wishful thinking prompted by the vision of intelligent
conversational machines that is often portrayed in science
fiction?
This talk will review the state-of-the-art in automatic speech
recognition, and will illustrate that current performance is simply
not good enough for many applications. A comparison will be made
between ASR performance and that of a human listener, as well as
alternative methods of human-machine interaction. It will then be
shown how the 'goodness' of an ASR may be characterised by modelling
an equivalence to a hearing impaired human, and this will be used to
predict the future capabilities of ASR. However, attention is drawn
to ASR's reliance on ever increasing amounts of training data, and
this is contrasted with the amount of speech exposure involved in
human spoken language acquisition. The talk concludes with a
discussion of the implications for the future of automatic speech
recognition.
PDF slides
16 March 2005
Sarah Simpson
Department of Computer Science, University of Sheffield, UK
Consonant identification in N-talker babble as a function of
N
Consonant identification rates were measured for VCV tokens gated
with N-talker babble noise and babble-modulated noise for an extensive
range of values for N. In the natural babble condition,
intelligibility was a non-monotonically function of N, with a broad
performance minimum from N = 6 to N = 128. Identification rates in
babble-modulated noise fell gradually with N. The contributions of
factors such as energetic masking, linguistic confusion, attentional
load, peripheral adaptation and stationarity to the perception of
consonants in N-talker babble are discussed.
9 March 2005
Ryuichiro Higashinaka
Department of Computer Science, University of Sheffield, UK
Incorporating Discourse Features into Confidence Scoring of
Intention Recognition Results in Spoken Dialogue Systems
This paper proposes a method for the confidence scoring of
intention recognition results in spoken dialogue systems. To achieve
tasks, a spoken dialogue system has to recognize user
intentions. However, because of speech recognition errors and
ambiguity in user utterances, it sometimes has difficulty recognizing
them correctly. Confidence scoring allows errors to be detected in
intention recognition results and has proved useful for dialogue
management. Conventional methods use the features obtained from speech
recognition results for single utterances for confidence
scoring. However, this may be insufficient since the intention
recognition result is a result of discourse processing. We propose
incorporating discourse features for a more accurate confidence
scoring of intention recognition results. Experimental results show
that incorporating discourse features significantly improves the
confidence scoring.
This is an ICASSP presentation.
16 February 2005
Kalle Palomäki
Helsinki University of Technology, Finland
Spatial Processing in Human Auditory Cortex: The Effects of 3D,
ITD and ILD Stimulation Technique
Here, the perception of auditory spatial information as indexed by
behavioral measures is linked to brain dynamics as reflected by the
N1m response recorded with whole-head magnetoencephalography
(MEG). Broadband noise stimuli with realistic spatial cues
corresponding to eight direction angles in the horizontal plane were
constructed via custom-made, individualized binaural recordings (BAR)
and generic head-related transfer functions (HRTF). For comparison
purposes, stimuli with impoverished acoustical cues were created via
interaural time and level differences (ITDs & ILDs) and their
combinations. MEG recordings in ten subjects revealed that the
amplitude of the N1m exhibits directional tuning to sound location,
with the right-hemispheric N1m responses being particularly sensitive
to the amount of spatial cues in the stimuli. The BAR, HRTF and
combined ITD+ILD stimuli resulted both in a larger dynamic range and
in a more systematic distribution of the N1m amplitude across stimulus
angle than did the ITD or ILD stimuli alone. The right-hemispheric
sensitivity to spatial cues was further emphasized with the latency
and source location of the N1m systematically varying as a function of
the amount of spatial cues present in the stimuli. In behavioral
tests, we measured the ability of the subjects to localize BAR and
HRTF stimuli in terms of azimuthal error and front-back confusions. We
found that behavioral performance correlated positively with the
amplitude of the N1m. Thus, the activity taking place already in the
auditory cortex predicts behavioral sound detection of spatial
stimuli, and the amount of spatial cues embedded in the signal are
reflected in the activity of this brain area.