Fully Authentic Visual Question Answering Dataset from Online Communities

Abstract

Visual Question Answering (VQA) entails answering questions about images. We introduce the first VQA dataset in which all contents originate from an authentic use case. Sourced from online question answering community forums, we call it VQAonline. We characterize this dataset and how it relates to eight mainstream VQA datasets. Observing that answers in our dataset tend to be much longer (i.e., a mean of 173 words) and so incompatible with standard VQA evaluation metrics, we instead utilize popular metrics for longer text evaluation for evaluating six state-of-the-art VQA models on VQAonline and report where they struggle most. Finally, we analyze which evaluation metrics align best with human judgments. We publicly-share the dataset at: https://vqaonline.github.io/.

Cite

Text

Chen et al. "Fully Authentic Visual Question Answering Dataset from Online Communities." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73195-2_15

Markdown

[Chen et al. "Fully Authentic Visual Question Answering Dataset from Online Communities." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/chen2024eccv-fully/) doi:10.1007/978-3-031-73195-2_15

BibTeX

@inproceedings{chen2024eccv-fully,
  title     = {{Fully Authentic Visual Question Answering Dataset from Online Communities}},
  author    = {Chen, Chongyan and Liu, Mengchen and Codella, Noel C and Li, Yunsheng and Yuan, Lu and Gurari, Danna},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73195-2_15},
  url       = {https://mlanthology.org/eccv/2024/chen2024eccv-fully/}
}