Workshop Programme

The workshop will be a full-day event from 9:00 (TBC) to 18:00 and followed by an on-site evening social. The programme will be single-track and consist of a mix of poster sessions, oral sessions and two invited keynote talks.

Keynote 1: Florian Metze, Carnegie Mellon University

Open-domain audiovisual speech recognition and video summarization


Video understanding is one of the hardest challenges in AI. If a machine can look at videos and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites. Making progress towards this goal requires contributions from experts in diverse fields, including computer vision, automatic speech recognition, machine translation, natural language processing, multimodal information processing, and multimedia. I will report the outcomes of the JSALT 2018 Workshop on this topic, including advances in multitask learning for joint audiovisual captioning, summarization, and translation, as well as auxiliary tasks such as text-only translation, language modeling, story segmentation, and classification. I will demonstrate a few results on the “How-to” dataset of instructional videos harvested from the web by my team at Carnegie Mellon University and discuss remaining challenges and possible other datasets for this research.

Keynote 2: John HL Hansen, University of Texas at Dallas

Robust speaker diarization and recognition in naturalistic data streams: Challenges for multi-speaker tasks & learning spaces


Speech Technology is advancing beyond general speech recognition for voice command and telephone applications. Today, the emergence of many voice enabled speech systems have required the need for more effective distant based speech voice capture and automatic speech and speaker recognition. The ability to employ speech and language technology to assess human- to-human interactions is opening up new research paradigms which can have a profound impact on assessing human interaction including personal communication traits, and contribute to improving the quality of life and educational experience of individuals. In this talk, we will explore recent research trends on automatic audio diarization and speaker recognition for audio streams which include multi-tracks, speakers, and environments with distant based speech capture. Specifically, we will consider (i) Prof-Life-Log corpus, (ii) Education based child & student based Peer-Lead Team Learning, and (iii) Apollo-11 massive multi-track audio processing (19,000hrs of data). These domains in the context of CHIME workshops will be discussed in terms of algorithmic advancements, as well as directions for continued research..

CHARMINAR, Hyderabad