### Multistream techniques for audio-visual recognition: an analysis

#### Frédéric Berthommier

#### ICP, Grenoble, France

Audio visual recognition is a mature part of the multistream recognition field in which many experiments have been achieved. In light of simulations based on ANN/HMM recognition and the use of the STRUT software package allowing access to the posterior values, we analysed the similarity between standart models of audio-visual fusion and respite multistream techniques. The classical model of audio visual fusion for AVSR in noise consists in weighting audio and video posteriors as follows:

*P(Hi lA,V) = P(HilA)^a * P(HilV)^(1-a)* **(Eq. 1)**

where *a* depends on the reliability of audio data and with clear video data. On the other hand, perceptual studies achieved by Massaro et al. have shown that the Bayes' rule is effective for modelling the natural audio-visual complementarity:
*P(Hi lA,V) = P(HilA) * P(HilV) * eps/P(Hi)* **(Eq. 2)**

where *eps* is a renormalisation coefficient. If we integrate the priors in (Eq. 1), we obtain what we call geometric weighting, proposed by Heckmann et al. (JASP, 2002):
*P(Hi lA,V) = P(HilA)^a * P(HilV)^b * eps' / P(HilA)^(a+b-1)* **(Eq. 3)**

where *a* and *b* vary in an opposite direction according to one parameter *c*. Compared to the two other models (Eq. 3) allows the best results in noise, mainly for low SNR, and it well assumes the audio visual challenge A+V > A or V. On the other hand, the Eq. 1 allows WER rates close to min(A,V), and the Bayes' rule fails when SNR < 0dB. In fact, the (Eq. 3) results from the merging of (Eq. 1) and (Eq. 2), from which it acquiers the best properties. The bridge between the FCA proposed by Morris et al. and the geometrical weighting appears when we develop the FCA as following:
*P(Hi lA,V) = a * P(HilA) * P(HilV) * eps''/P(Hi) + b * P(HilA) + c* P(HilV) + d*P(Hi)* **(Eq. 4)**

where the contributions of the four combinations are weighted linearly, with coefficients set according to the same parameter *c*. This equation also includes the Bayes' rule and has the same tendencies as (Eq. 3). The values of AV posteriors derived from these two equations are close together and the results are similar ones. The cause of failure of the Bayes' rule used in isolation appears when we establish the confusion matrices of audio in noise (< 0 dB) and video alone at the phonetic level. The main error is the false identification of the silence state having a lower rate of about 22% for video and 75% of audio at -6dB. Consequently, these cross-modal coherent errors cannot be filtered out by the Bayes' rule and better is to lower the stream having the higher confusion for the silence state. This is what occurs with (Eq. 3) and (Eq. 4). The automatic setting of the parameter c has been tested with 3 different reliability measures (entropy, probabilistic voicing index and dispersion) having similar performances. Surprisingly, there is no difference between adaptive (frame by frame) and constant weighting (when the weights have a constant values for all utterances in a a given condition). So we conclude the role of the weighting in the context of continuous speech recognition is not to implement, nor to enhance the audio-visual complementarity at the phonetic level; this is already achieved by the inclusion of the Bayes' rule; but, in loud noise, to switch globally towards the modality having less confusion with the silence state because this kind of error cannot be compensated.

Jon Barker
Last modified: Mon Jan 29 15:59:06 GMT 2001