Episodic Memory for Automatic Speech Recogntition

Investigator: Viktoria Maier Supervisors: Roger K. Moore

In recent times it has become apparent that, not only is the performance of automatic speech recognition (ASR) an order of magnitude worse than human speech recognition (HSR) [1], but the incremental improvements in state-of-the-art systems are also asymptoting to a level of performance that is well short of that required for many practical applications [2].

In current state-of-the-art systems, the probability density functions (pdfs) provide a powerful method of generalising from seen to unseen data. However, the use of pdfs represents a potential loss of information; the detail that is present in individual data samples is sacrificed in order to pool information in a controlled fashion.

This work is investigating solutions that retain the 'fidelity' contained in individual tokens, and it draws on insights from the HSR literature on 'episodic memory' [3][4][5]. We refer to the approach as 'episodic modelling'. The long-term aim is to combine the benefits of ASR and HSR to produce performance that is in advance of conventional ASR.

[1] Lippmann, R., "Speech Recognition by Machines and Humans", Speech Communication, 22, 1-15, Elsevier, 1997.
[2] Moore R K., "A comparison of the Data Requirements of Automatic Speech Recognition Systems and Human Listeners", Proc. Eurospeech, 2582-2584, 2003.
[3] Luce, P. A. and Lyons, E. A., "Specificity of Memory Representations for Spoken Words", Memory and Cognition, 26(4): 708-715, 1998.
[4] Goldinger, S. D., "Words and Voices: Episodic Traces in Spoken Word Identification and Recognition Memory", Journal of Experimental Psychology: Learning Memory and Cognition, 22(5): 1166-1183, 1996.
[5] Kraljevic T. and Samuel A. G., "How General is Perceptual Learning for Speech?", unpublished