Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We from Human-Level Robustness?
Abstract
This paper introduces a novel evaluation framework, inspired by methods from human psychophysics, to systematically assess the robustness of multimodal integration in audiovisual speech recognition (AVSR) models relative to human abilities. We present preliminary results on AV-HuBERT suggesting that multimodal integration in state-of-the-art (SOTA) AVSR models remains mediocre when compared to human performance and we discuss avenues for improvement.
Cite
Text
Schweitzer et al. "Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We from Human-Level Robustness?." NeurIPS 2024 Workshops: Behavioral_ML, 2024.Markdown
[Schweitzer et al. "Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We from Human-Level Robustness?." NeurIPS 2024 Workshops: Behavioral_ML, 2024.](https://mlanthology.org/neuripsw/2024/schweitzer2024neuripsw-multimodal/)BibTeX
@inproceedings{schweitzer2024neuripsw-multimodal,
title = {{Multimodal Integration in Audio-Visual Speech Recognition --- How Far Are We from Human-Level Robustness?}},
author = {Schweitzer, Marianne and Montagnini, Anna and Fourtassi, Abdellah and Schatz, Thomas},
booktitle = {NeurIPS 2024 Workshops: Behavioral_ML},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/schweitzer2024neuripsw-multimodal/}
}