Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Penamakuri, Abhirama Subramanyam; Gupta, Manish; Das Gupta, Mithun; Mishra, Anand

doi:10.24963/IJCAI.2023/146

Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering

Abhirama Subramanyam Penamakuri, Manish Gupta, Mithun Das Gupta, Anand Mishra

IJCAI 2023 pp. 1312-1321

doi:10.24963/IJCAI.2023/146 /ijcai/2023/penamakuri2023ijcai-answer/

Abstract

We study visual question answering in a setting where the answer has to be mined from a pool of relevant and irrelevant images given as a context. For such a setting, a model must first retrieve relevant images from the pool and answer the question from these retrieved images. We refer to this problem as retrieval-based visual question answering (or RETVQA in short). The RETVQA is distinctively different and more challenging than the traditionally-studied Visual Question Answering (VQA), where a given question has to be answered with a single relevant image in context. Towards solving the RETVQA task, we propose a unified Multi Image BART (MI-BART) that takes a question and retrieved images using our relevance encoder for free-form fluent answer generation. Further, we introduce the largest dataset in this space, namely RETVQA, which has the following salient features: multi-image and retrieval requirement for VQA, metadata-independent questions over a pool of heterogeneous images, expecting a mix of classification-oriented and open-ended generative answers. Our proposed framework achieves an accuracy of 76.5% and a fluency of 79.3% on the proposed dataset, namely RETVQA and also outperforms state-of-the-art methods by 4.9% and 11.8% on the image segment of the publicly available WebQA dataset on the accuracy and fluency metrics, respectively.

PDF IJCAI Semantic Scholar

Cite

Text

Penamakuri et al. "Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/146

Markdown

[Penamakuri et al. "Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/penamakuri2023ijcai-answer/) doi:10.24963/IJCAI.2023/146

BibTeX

@inproceedings{penamakuri2023ijcai-answer,
  title     = {{Answer Mining from a Pool of Images: Towards Retrieval-Based Visual Question Answering}},
  author    = {Penamakuri, Abhirama Subramanyam and Gupta, Manish and Das Gupta, Mithun and Mishra, Anand},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {1312-1321},
  doi       = {10.24963/IJCAI.2023/146},
  url       = {https://mlanthology.org/ijcai/2023/penamakuri2023ijcai-answer/}
}