Dimitrios Pappas

International Faculty, CITY College Thessalonikin

Harnessing Emergence for Distributed CASA

Abstract: Computational Auditory Scene Analysis (CASA) is the study of the human auditory system using computational means, specifically how it organises sound into perceptually meaningful elements. CASA (machine listening) systems, essentially aspire to separate mixtures of sound sources into individual sound sources, based on findings from assorted disciplines such as neuroscience and biology. These systems find important applications such as hearing prostheses, robust automatic speech and speaker recognition, and auditory scene reconstruction. With the continuous technological advancements in robotics, mobile sensors and smart devices, a realm of possibilities for new CASA applications is open for exploration. This study aims to use bio-inspired emergence as a navigation tool, which is the design of adaptive and optimised complex systems in a macroscopic level through the implementation of microscopic properties in their components. The current focus of the pilot study is to investigate the capacity for optimisation of energy management and movement strategies of mobile devices tracking sound sources within noisy environments, inspired by treefrog biology and community behavior. Future efforts will introduce more low-level properties and component interaction models, some inspired by other animals, that can produce desirable high-level properties to harness for assorted CASA applications.

Erfan Loweimi

SPandH - Pre PhD Viva

Robust Phase-based Speech Signal Processing; From Source-Filter Separation to Model-Based Robust ASR

The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications. Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information. In this thesis a novel phase-domain source-filter model is proposed that allows for deconvolution of the speech vocal tract (filter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter. Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement.x The robustness gain achieved through using statistical normalisation and generalised logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and finally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domain are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise.

Department of Computer Science

SpandH Seminar Abstracts

Dimitrios Pappas

Harnessing Emergence for Distributed CASA

Erfan Loweimi

Robust Phase-based Speech Signal Processing; From Source-Filter Separation to Model-Based Robust ASR