Maintaining Reasoning Consistency in Compositional Visual Question Answering
Abstract
A compositional question refers to a question that contains multiple visual concepts (e.g., objects, attributes, and relationships) and requires compositional reasoning to answer. Existing VQA models can answer a compositional question well, but cannot work well in terms of reasoning consistency in answering the compositional question and its sub-questions. For example, a compositional question for an image is: "Are there any elephants to the right of the white bird?" and one of its sub-questions is " Is any bird visible in the scene?". The models may answer "yes" to the compositional question, but "no" to the sub-question. This paper presents a dialog-like reasoning method for maintaining reasoning consistency in answering a compositional question and its sub-questions. Our method integrates the reasoning processes for the sub-questions into the reasoning process for the compositional question like a dialog task, and uses a consistency constraint to penalize inconsistent answer predictions. In order to enable quantitative evaluation of reasoning consistency, we construct a GQA-Sub dataset based on the well-organized GQA dataset. Experimental results on the GQA dataset and the GQA-Sub dataset demonstrate the effectiveness of our method.
Cite
Text
Jing et al. "Maintaining Reasoning Consistency in Compositional Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00504Markdown
[Jing et al. "Maintaining Reasoning Consistency in Compositional Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/jing2022cvpr-maintaining/) doi:10.1109/CVPR52688.2022.00504BibTeX
@inproceedings{jing2022cvpr-maintaining,
title = {{Maintaining Reasoning Consistency in Compositional Visual Question Answering}},
author = {Jing, Chenchen and Jia, Yunde and Wu, Yuwei and Liu, Xinyu and Wu, Qi},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2022},
pages = {5099-5108},
doi = {10.1109/CVPR52688.2022.00504},
url = {https://mlanthology.org/cvpr/2022/jing2022cvpr-maintaining/}
}