Workshop Programme

The workshop will be a full day event with a mix of poster sessions, oral sessions and will feature two invited keynote speakers.

Keynote 1: Florian Metze, Carnegie Mellon University

Open-domain audiovisual speech recognition and video summarization


Video understanding is one of the hardest challenges in AI. If a machine can look at videos and “understand” the events that are being shown, then machines could learn by themselves, perhaps even without supervision, simply by “watching” broadcast TV, Facebook, Youtube, or similar sites. Making progress towards this goal requires contributions from experts in diverse fields, including computer vision, automatic speech recognition, machine translation, natural language processing, multimodal information processing, and multimedia. I will report the outcomes of the JSALT 2018 Workshop on this topic, including advances in multitask learning for joint audiovisual captioning, summarization, and translation, as well as auxiliary tasks such as text-only translation, language modeling, story segmentation, and classification. I will demonstrate a few results on the “How-to” dataset of instructional videos harvested from the web by my team at Carnegie Mellon University and discuss remaining challenges and possible other datasets for this research.

Keynote 2: John HL Hansen, University of Texas at Dallas

Robust speaker diarization and recognition


