Learning to Answer Questions in Dynamic Audio-Visual Scenarios

Abstract

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/

Cite

Text

Li et al. "Learning to Answer Questions in Dynamic Audio-Visual Scenarios." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01852

Markdown

[Li et al. "Learning to Answer Questions in Dynamic Audio-Visual Scenarios." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/li2022cvpr-learning-d/) doi:10.1109/CVPR52688.2022.01852

BibTeX

@inproceedings{li2022cvpr-learning-d,
  title     = {{Learning to Answer Questions in Dynamic Audio-Visual Scenarios}},
  author    = {Li, Guangyao and Wei, Yake and Tian, Yapeng and Xu, Chenliang and Wen, Ji-Rong and Hu, Di},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {19108-19118},
  doi       = {10.1109/CVPR52688.2022.01852},
  url       = {https://mlanthology.org/cvpr/2022/li2022cvpr-learning-d/}
}