m4 - multimodal meeting manager

Overview

The M4 project started in March 2002, and has a duration of three years.

The overall objective of the project is the construction of a demonstration system to enable structuring, browsing and querying of an archive of automatically analysed meetings. The archived meetings will have taken place in a room equipped with multimodal sensors.

For each meeting, audio, video, textual, and (possibly) interaction information will be available. Audio information will come from close talking and distant microphones, as well as binaural recordings. Video information will come from multiple cameras. While the video and audio information will form several streams of data generated during the meeting, the textual information---the agenda, discussion papers, text of slides---will be pre-generated and can be used to guide the automatic structuring of the meeting. The interaction stream consists of any information that can help in analysing events within the meeting, for example, mouse tracking from a PC-based presentation or laser pointing information.

Objectives

  1. Development of a "smart" meeting room, collection and annotation of a multimodal meetings database.
  2. Analysis and processing of the audio and video streams:
    • Robust conversational speech recognition, to produce a word-level transcription;
    • Recognition of gestures and actions;
    • Multimodal identification of intent and emotion;
    • Multimodal person identification;
    • Source localization and tracking.
    Although the technologies addressed here are imperfectly developed, they are established firmly enough to warrant their use in combination. Integration of multiple sensory inputs is a challenging problem at the early stages of investigation.
  3. Integration and structuring using the output of the various recognizers and analyses:
    • Specification of a flexible intelligent information management framework;
    • Models for the integration of multimodal streams, including statistical models for asynchronous multiple streams, multimodal syntax and multisource decoding;
    • Summarization of a meeting, or a meeting segment; this could take various forms such as a textual precis or a set of video key frames;
    • Multimodal information extraction and cross-lingual retrieval/browsing across the archive.
  4. Construction of a demonstrator system for browsing and accessing information from an archive of processed meetings.

We can illustrate these objectives by outlining a proposed meeting browser. Consider a meeting (of 3-15 people), following an agenda. Offline, the meeting is segmented into agenda items. The browser follows the structure of the agenda. Clicking on an agenda item brings up a set of ways to view that topic. These might include: a textual summary; a diagrammatic discussion flow indicating which participants were involved; a set of audio or video key frames to give the essence of the discussion. A more advanced browser might be able to identify what actions were agreed as the outcome of the discussion on that topic. It would also be possible to query the rest of the archive either by example from that segment, or through an explicit query.

The existence of the textual "side information" enables the application of some useful constraints. As a minimum there will be a prespecified agenda for the meeting that provides a basic structure in terms of a topic sequence. There may also be discussion papers or the text of slides associated with an item, which may be used to adapt the language model of the speech recognizer, or to act as query expansion information for the retrieval system. In the case where slides or other visual aids are used, the interaction stream (typically containing ``pointing'' information) may be used as a further constraint, giving information about a sub-topic sequence, for example.