D1.1: Specification of Smart Room Environment and Data Collection and Annotation Protocols

Cover Pages


M4 will perform multimodal data collection using a smart room constructed at IDIAP. This deliverable describes the specification of the room environment and the data collection and annotation protocols that we will use.

The smart room has been fully specified and installed at IDIAP. The current installation includes 24 microphones, 3 fixed cameras, and synchronization and acquisition equipment. The setup has been fully debugged, and meeting recordings are proceeding.

This document contains the specification of the equipment used, a discussion of the configurations used for the recording of meetings, and details of the annotation protocol to be used.

Smart Room Equipment

Audio Acquisition

Video Acquisition

Recordings are Digital and Full PAL resolution and framerate.



The system has been fully installed at IDIAP. Pictures of the smart room are included in the appendix.

Recording of Meetings


For audio-visual recordings, the installation is dimensioned for a room with six persons. Two cameras are placed front-on view of three people. The third camera looks down table towards front of room (whiteboard, projector screen, whole meeting). From 12 to 16 microphones are arranged in table-top arrays.

For audio-only recordings, up to 12 meeting participants using lapel microphones only can be recorded and up to 8 participants with lapel microphones and in table-top array microphones.

Output Media

Two output formats are available:

System Expansion

The meeting recording configuration (number of channels, camera placement, microphone arrays, etc..) has been tailored to needs of IDIAP researchers. Nevertheless, the system is scalable and it is easy to add more video and audio channels (binaural manikin, 1 camera per person).

It is also possible to integrate specific H/W from other partners and in collection/management of additional data.


In M4 we propose to collect 50-100 hours of meeting data, some audio-video, some audio-only. It would be a major undertaking to attempt to perform a detailed transcription of such a database. Instead we propose to transcribe a small number of hours for research purposes (principally evaluation.

M4 subcontractor ICSI has transcribed about 90 hours of meetings audio, and this data is available to the project. Detailed audio transcription carried out in M4 will use the transcription protocols developed by ICSI for the transcription of the ICSI corpus of meeting audio.

One of the main questions is how to represent the output of visual processing for retrieval/indexing purposes. In view of the lack of our own annotation tools, we propose to use (and, if necessary, develop) public video annotation tools (such as Anvil).

The main annotation guideline for M4, will to be to annotate guided by information access needs (WP3): possible events to be annotated include emotion, decision points, gestures. To this end, an initial set of meetings have been recorded and this data will be distributed among all the partners for event annotation (and testing of recognizers), to form the focus of the project meeting in January 2003.

MPEG-7 is becoming a major component of the digital media scene, and we plan to ensure that M4 data and annotations are organized in a way that is compliant withe MPEG-7, whenever practical.

Appendix: Pictures of M4 Smart Room

Fig 1: Overview of Smart Room Fig 2: Overview of Smart Room

Fig 3: Tabletop microphone array Fig 4: Acquisition and synchronization

Fig 5: Video Walkmen Fig 6: Fixed CCTV camera Fig 7: Mobile minature CCTV camera

Fig 8: Still from camera 1 during meeting Fig 9: Still from camera 2 Fig 10: Still from camera 3

Steve Renals
Last modified: Mon Oct 14 17:36:01 BST 2002