31 October 2007
Ian Howard
Biological & Machine Learning Lab, Department of Engineering, University of Cambridge, UK
A Computational Model of Infant Speech Development
Almost all theories of child speech development assume that
an infant learns speech sounds by direct imitation, performing
an acoustic matching of adult output to his own speech. Some
theories also postulate an innate link between perception and
production. We present a computer model which has no
requirement for acoustic matching on the part of the infant and
which treats speech production and perception as separate
processes with no innate link. Instead we propose that the
infant initially explores his speech apparatus and reinforces his
own actions on the basis of sensory salience, developing vocal
motor schemes [1]. As the infants production develops, he
will start to generate utterances which are sufficiently speech like
to provoke a linguistic response from its mother. Such,
interactions are particularly important, because she is better
qualified than he is to judge the quality of his speech. Her
response to his vocal output is beneficial in a number of ways.
Because she is a learned speaker, her experienced perceptive
system can effectively evaluate the infants output within the
phonological system of the ambient language L1. Simply
generating a salient response will tend to encourage the
infants production of a given utterance. More significantly,
during imitative exchanges in which the mother reformulates
the infants speech, the infant can learn equivalence relations
using simple associative mechanisms between his motor
activity and his mothers acoustic output, and thus can solve
the correspondence problem. Notice that the infant does not
learn equivalence relations between his own acoustic output
and that of his mother based on acoustic similarity. Any
similarity based matching need only needs to be performed by
his mother.
[1] McCune, L., Vihman M.M. 1987. Vocal Motor Schemes.
Papers and Reports in Child Language Development,
Stanford University Department of Linguistics 26, 72-79.
Speech Communication paper (draft)
PPT slides
17 October 2007
Peter van Hengel
Sound Intelligence, The Netherlands
Verbal Aggression Detection in Complex Acoustical Environments
A major problem in the developing market for CCTV systems is the monitoring of
all the camera images. Recent studies reveal that some 70% of incidents in view
of surveillance cameras are not noticed by the camera operators. This is not
surprising if one takes into account the number of cameras to be monitored
simultaneously. Comparison with a human observer it is clear that adding sound as
a trigger to camera surveillance should significantly improve the efficiency and
detection rate, while at the same time decreasing the impact on privacy. A verbal
aggression system has been developed by Sound Intelligence in cooperation with
the University of Groningen's Department of Artificial Intelligence. This system
has an attentional mechanism based on human neural processes which identifies
'interesting sounds'. These sounds are then analysed for cues of aggression. If
sufficient evidence for verbal aggression have been found, a signal is given to
the camera operator. A 6 month study in Groningen's city center shows a detection
rate of 100% with a false alarm rate of 0.15%. The system is currently being
introduced into the english market.
13 September 2007
Jort Gemmeke
Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands
On the relation between statistical properties of spectrographic masks and their ability to reduce acoustic mismatch
The application of missing feature techniques (MFT) can significantly improve
the noise robustness of automatic speech recognition (ASR). When using MFT for
decoding, adequately estimated spectrographic masks are crucial. For the
development of better mask estimation techniques it would be extremely valuable
to have an analysis technique that can indicate which properties of a mask need
to be changed to serve its main purpose: Reducing the acoustic mismatch between
a noisy feature vector and its corresponding acoustic model. As a first step
toward such an analysis technique we present a framework which enables us to
compare two mask creation techniques. We show that, using a suitable distance
function, we can relate the statistical properties of mask types to their
effectiveness in reducing the acoustic mismatch at frame level.
PPT slides
6 September 2007
Maria Wolters
CSTR, University of Edinburgh, UK
Adapting Dialogue Systems to Older People
In this talk, I present an overview of a Wizard-of-Oz experiment that
explored two guidelines for designing spoken dialogue systems for older
people:
- To minimise cognitive load, use few options
- Confirm selections to support memory
Cognitive abilities change as people age. For some people, this change
occurs more rapidly than for others. The areas affected most are speed
of processing, fluid intelligence, and working memory, while
crystallised intelligence is well-preserved. In order to make spoken
dialogue systems more useable for older people, one might assume that
they need to accommodate this cognitive ageing.
Our experiments show that for straightforward tasks such as appointment
scheduling that follow well-known schemata, accommodation may not be
necessary. We found no effects of dialogue strategy on user performance
and user satisfaction, although individual users expressed strong,
divergent preferences. There was a considerable efficiency effect,
though.
Future work includes exploring different domains, such as flight
booking, working with different user groups, such as older people with
mild memory impairment, and investigating differences in interaction
style between older and younger users on the corpus collected in this
experiment.
PDF slides
3 September 2007
Takatoshi Okuno
Audiological R&D lab, Rion Ltd, Tokyo, Japan
Development of Frequency Selectivity Map (FSMap) depiction system for hearing impairment
Background: An audiological assessment of hearing impairments and the
fitting of hearing aids require an accurate measurement as auditory
profiles that can be composed of loudness recruitment, reduced
frequency selectivity and reduced temporal resolution etc. In other
word, it is difficult to evaluate such a complicated auditory
characteristic of sensorineural hearing loss, using only audiogram. A
system that can measure the auditory profile has been waited at the
practical clinic and the fitting of hearing aids.
Method: It is known that frequency selectivity is related to the
bandwidth of auditory filters. In order to investigate the reduced
frequency selectivity, a system that enables the measurement of an
individual auditory filter of a hearing impairment within 3 minutes
has been developed. The measured auditory filters are utilised to
draw Frequency Selectivity Map (FSMap) with colored gradation.
Actually, the data of auditory filters with respect to several
frequencies and sensation levels are employed to draw a FSMap. The
system calculates the ratio between Equivalent Rectangular Bandwidth
(ERB) of an individual hearing-impaired listener and the averaged ERB
of normal-hearing listeners as a possible auditory profile. Auditory
filters of 31 hearing impairments were measured using the proposed
system.
Results: Results showed that the frequency selectivity of
sensorineural hearing loss was reduced as comparing with that of the
normal hearing. Different degrees of frequency selectivity for
individuals who had the similar contour of audiogram were observed in
the obtained FSMaps.
Conclusion: Reduced frequency selectivity, i.e. how much your
frequency selectivity is poorer than normal hearings, is
quantitatively and intuitively indicated by FSMap. Results of this
study suggested that FSMap had a potential to be a new practical
auditory profile.
22 August 2007
Jonathan Laidler
Department of Computer Science, University Sheffield, UK
Model-driven detection of clean speech patches in noise
Listeners may be able to recognise speech in adverse conditions by
"glimpsing" time-frequency regions where the target speech is dominant.
Previous computational attempts to identify such regions have been
source-driven, using primitive cues. This talk describes a model-driven
approach in which the likelihood of spectro-temporal patches of a noisy
mixture representing speech is given by a generative model. The focus is
on patch size and patch modelling. Small patches lead to a lack of
discrimination, while large patches are more likely to contain
contributions from other sources. A "cleanness" measure reveals that a
good patch size is one which extends over a quarter of the speech
frequency range and lasts for 40 ms. Gaussian mixture models are used to
represent patches. A compact representation based on a 2D discrete
cosine transform leads to reasonable speech/background discrimination.
25 July 2007
Sarah Creer
Department of Computer Science, University Sheffield, UK
Modern Speech Synthesis
The aim of this work is to demonstrate what speech synthesis
techniques are available for building synthetic voices for potential use
in communication aids. Synthetic speech is generally viewed as sounding
robotic and unnatural, frequently showing some mismatch between the
voice and the communication aid user. Current technology is providing
techniques to allow the building of more personalised output to the
communication aid user using a minimal amount of data to produce a
natural and intelligible sounding output.
Current VOCAs only allow personalisation of gender, language and to a
limited extent, age. The voice is an identifier of the person to whom it
belongs, providing clues about the gender, age, size, ethnicity and
geographical identity of the person along with identifying them as that
particular individual to family members, friends and, once interaction
has begun, to new communication partners. Maintaining the individual's
identity will help the ease of interaction and maintenance of social
relationships. Attempts to provide a more closely matching voice for the
user can be demonstrated with the VIVOCA device, currently under
development at the University of Sheffield and Barnsley Hospital. The
VIVOCA takes the speech of a speaker with dysarthria, recognises it and
sends a transcription to a text-to-speech synthesiser which will then
output what the client is saying.
The VIVOCA users will be dysarthric speakers in the Barnsley and
Sheffield area and so to preserve the geographic identity of the
clients, attempts have been made to provide speech synthesisers with a
local accent. To build voices for speech synthesis, different techniques
have been used. These techniques require data from a speaker with a
local accent. To achieve the best quality of recordings and a consistent
and natural set of data, a professional speaker was sought. Ian
McMillan, a Barnsley poet and broadcaster became the first of a proposed
library of locally accented recordings of a phonetically balanced
dataset from which to build a synthetic voice. Using this data,
different techniques have been used to build synthetic voices such as
concatenative synthesis and hidden Markov model based synthesis.
Research will also take place into how to extend this voice building
into further personalising dysarthric speakers' synthetic voices,
particularly for those clients with progressive speech disorders who
will eventually rely on communication aids as their primary mode of
communication. Attempts will be made using data from dysarthric speakers
who still have some voice function to adapt the synthetic voice so that
it will sound more like that individual.
7 June 2007
Piers Messum
Department of Phonetics and Linguistics, University College London, UK
No role for imitation in learning to pronounce
In due course, a child learns to pronounce words by imitation, copying the
sequence of speech sounds that he or she hears used by adult models. However,
there is a conceptual precursor to this that we can call plain "learning to
pronounce", where a child learns to produce speech sounds that listeners take to
be equivalent to those that they themselves produce. It is almost universally
assumed that this process is also imitative, but there has been no critical
examination of this belief. In fact, there is no evidence to support it and some
good reasons to doubt it.
There is an alternative, by which a child might learn to pronounce through
vocal mirroring on the part of his or her caregivers. This is plausible in the
light of other mirroring interactions, explains some communicative behaviors of
mothers and infants, and solves some of the fundamental problems in speech.
Paper of the talk
PhD thesis
23 May 2007
Robin Hofe
Department of Computer Science, University of Sheffield, UK
Tongues, Trunks and Tentacles: Energetics in Physiology and Speech
One of the amazing things about spoken language is the huge
amount of variation that occurs within the speech signal.
Humans are perfectly able to cope with that variation or to
even use it to convey additional information or to increase
intelligibility for the listener. On the other hand, the same
variation causes great problems in technological applications.
One possible way to explain some of the variation is the H&H
theory by Björn Lindblom. To investigate the theoretical basis
of that theory, a biomimetic vocal tract model is currently
being developed. It will be used to clarify the energetics of
human articulatory movements and provide an experimental speech
simulation tool.
PPT slides
Audio and video associated with the slides
2 May 2007
Yanchen Lu
Department of Comuter Science, University of Sheffield, UK
Mechanisms for Human Speech Acquisition and Processing
Auditory distance perception is an important component of spatial
hearing. Humans often give a biased estimate of perceived distance
with respect to physical source distance. Typically, subjects tend to
underestimate the far away sound sources and overestimate the sound
sources which are closer than 1 metre. Intensity, reverberation,
interaural difference and spectrum are believed to be the major cues
in generating the distance judgement. Studies suggest those static
cues cannot lead to an absolute judgement without prior information
about the sound source or environment. However, it is possible to
utilize the temporal context of auditory dynamic cues, motion parallax
and acoustic tau, to infer an absolute distance by a moving listener.
In this presentation, I will introduce you to a sequential model to
obtain the distance estimate by employing the dynamic cues under the
framework of particle filtering. This sequential model highlights the
importance of temporal reasoning in this perception task by
demonstrating superiority over the instantaneous model.
PDF slides
25 April 2007
Russell Mason
Institute of Sound Recording, University of Surrey, Guildford, UK
The role of head movement in perception and measurement of spatial impression
In most listening situations, humans actively investigate the sound
field by moving their heads, especially if the sound field is
ambiguous or confusing. However, measurements tend to use a single
static receiver position, which may not adequately reflect the
listener's perception. A project is currently being undertaken to
investigate the type of head movements made by listeners in different
situations, how best to capture binaural signals to represent these
movements, and how to interpret the results of objective analyses
incorporating a range of head positions. The results of the first
experiment will be discussed in detail, showing that the head
movements made by listeners are strongly dependent on the listening
task.
28 March 2007
Mike Carey
University of Birmingham, UK
Mechanisms for Human Speech Acquisition and Processing
The methods that humans use for processing speech and language is an exciting area of
scientific enquiry. Since, almost without exception, humans are also far superior to
machines in this task human speech processing is of considerable interest to engineers
designing speech processing systems. A key feature of the human system is that it is
learnt, we are not born talking. Hence any system proposed as an analogue of human speech
processing must take this into account. However there are strong differences of opinion
between those like Chomsky who believe some evolutionary pre-adaptation of the human brain
is necessary for speech processing and those like Piaget who believe it is solely a
consequence of human brain processing power. It's also important to keep in mind that
language is acquired through speech and not text.
The approach described in this talk is to address this problem "bottom-up" starting
with the newborn infant's problem of discriminating between speech and noise. We then
address a possible method for acquiring the ability to discriminate between significant
speech features. We then describe the accommodation of timescale variability using a
multi-layered neural network model required to recognize phonemes and words.
PPT Slides
7 March 2007
Thomas Poulsen
Department of Computer Science, University of Sheffield, UK
Sound Localization Through Evolutionary Learning Applied to Spiking Neural Networks
There is much ongoing work aimed at understanding the neural functionality
involved in hearing. Typically this research attempts to "reverse-engineer" the neural
processes in different animals, which has provided many invaluable insights, but it is
nevertheless an extremely difficult task given the complexities involved. Simulations
however offer a different approach to this problem such that they provide a look into
neural encoding in much simpler structures. In my talk, I will present a simulative
approach to demonstrate that temporally encoded neural networks can be evolved for the
task of sound localization.
21 February 2007
Sue Harding
Department of Computer Science, University of Sheffield, UK
Auditory Gist Perception and Attention
In auditory scene analysis research, it is typically assumed that the
sensory signal is first broken down into features which are grouped into
auditory streams according to their low-level properties and then attention is
applied to select one of the streams. On the other hand, research into visual
attention has suggested that the gist of a visual scene is perceived before
attention is focused on the details of a particular object. In this talk, I
will describe some evidence that the gist of an auditory scene is also
perceived before the detail, and I will illustrate this with comparisons
between vision and hearing. Some ideas for gist representations will also be
discussed.
2 February 2007
Kalle Palomäki
Adaptive Informatics Research Centre, Helsinki University of Technology, Finland
Speech recognition activities in the Adaptive Informatics Research Centre
This talk gives an overview about automatic speech recognition activities in the Adaptive
Informatics Research centre in the Helsinki University of Technology. The group's research activities
are centred mostly on unlimited vocabulary morpheme-based recognition of Finnish language. In the
past the groups focus has been mostly to develop better language models for unlimited vocabulary
recognition, while the acoustic modelling side has been from a rather standard large vocabulary
ASR-system. However, recently we have been launching more activities in the acoustic modelling and
are putting up a team that concentrates on noise robust ASR and the use of auditory models in it. Two
of the specific starting points will be, one, to develop new missing data algorithms suitable for
large vocabulary ASR, two, to conduct human vs. machine speech recognition tests to better exploit
the knowledge about human speech recognition in ASR. In summary, the purpose of this talk is to
overview our group's activities and to initiate discussions about potential collaboration topics.
31 January 2007
Roger Moore
Department of Computer Science, University of Sheffield, UK
Sensorimotor Overlap in Living Organisms
On 11th January 2007, the University of Birmingham hosted a one-day open meeting on 'Unified
Models for Speech Recognition and Synthesis', and this talk is a repeat of the opening invited
presentation. The talk was founded on the premise that it is not only interesting to speculate on
the technological advantages for combining information between input and output modalities of
practical speech technology systems, but it is also informative to consider the evidence for such
sensorimotor overlap in living organisms. In fact recent research in the neurosciences offers strong
evidence for quite intimate connections between sensor and motor behaviour, and this has given rise
to some very interesting insights into the benefits that such organisational structures can provide.
This talk summarises these findings and seeks to illuminate the wider implications of attempting to
unify representations of interpretive and generative behaviour in living or automatic systems.