Learning to Answer Questions in Dynamic Audio-Visual Scenarios
Abstract
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/
Cite
Text
Li et al. "Learning to Answer Questions in Dynamic Audio-Visual Scenarios." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01852Markdown
[Li et al. "Learning to Answer Questions in Dynamic Audio-Visual Scenarios." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/li2022cvpr-learning-d/) doi:10.1109/CVPR52688.2022.01852BibTeX
@inproceedings{li2022cvpr-learning-d,
title = {{Learning to Answer Questions in Dynamic Audio-Visual Scenarios}},
author = {Li, Guangyao and Wei, Yake and Tian, Yapeng and Xu, Chenliang and Wen, Ji-Rong and Hu, Di},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {19108-19118},
doi = {10.1109/CVPR52688.2022.01852},
url = {https://mlanthology.org/cvpr/2022/li2022cvpr-learning-d/}
}