Visual Question Answering on 360deg Images
Abstract
In this work, we introduce VQA 360deg, a novel task of visual question answering on 360deg images. Unlike a normal field-of-view image, a 360deg image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360deg dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360deg, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360deg image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models. Nevertheless, the gap between the humans' and machines' performance reveals the need for more advanced VQA 360deg algorithms. We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task. Dataset, code, and pre-trained models are available online.
Cite
Text
Chou et al. "Visual Question Answering on 360deg Images." Winter Conference on Applications of Computer Vision, 2020.Markdown
[Chou et al. "Visual Question Answering on 360deg Images." Winter Conference on Applications of Computer Vision, 2020.](https://mlanthology.org/wacv/2020/chou2020wacv-visual/)BibTeX
@inproceedings{chou2020wacv-visual,
title = {{Visual Question Answering on 360deg Images}},
author = {Chou, Shih-Han and Chao, Wei-Lun and Lai, Wei-Sheng and Sun, Min and Yang, Ming-Hsuan},
booktitle = {Winter Conference on Applications of Computer Vision},
year = {2020},
url = {https://mlanthology.org/wacv/2020/chou2020wacv-visual/}
}