VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Abstract

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge.

Cite

Text

Wang et al. "VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01973

Markdown

[Wang et al. "VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/wang2023iccv-vqagnn/) doi:10.1109/ICCV51070.2023.01973

BibTeX

@inproceedings{wang2023iccv-vqagnn,
  title     = {{VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering}},
  author    = {Wang, Yanan and Yasunaga, Michihiro and Ren, Hongyu and Wada, Shinya and Leskovec, Jure},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {21582-21592},
  doi       = {10.1109/ICCV51070.2023.01973},
  url       = {https://mlanthology.org/iccv/2023/wang2023iccv-vqagnn/}
}