Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos
Abstract
360deg videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond predetermined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360deg video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to better semantic understanding of the panoramic surroundings on the dataset.
Cite
Text
Yun et al. "Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos." International Conference on Computer Vision, 2021.Markdown
[Yun et al. "Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yun2021iccv-panoavqa/)BibTeX
@inproceedings{yun2021iccv-panoavqa,
title = {{Pano-AVQA: Grounded Audio-Visual Question Answering on 360deg Videos}},
author = {Yun, Heeseung and Yu, Youngjae and Yang, Wonsuk and Lee, Kangil and Kim, Gunhee},
booktitle = {International Conference on Computer Vision},
year = {2021},
pages = {2031-2041},
url = {https://mlanthology.org/iccv/2021/yun2021iccv-panoavqa/}
}