Relation-Aware Graph Attention Network for Visual Question Answering

Abstract

In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose a Relation-aware Graph Attention Network (ReGAT), which encodes each image into a graph and models multi-type inter-object relations via a graph attention mechanism, to learn question-adaptive relation representations. Two types of visual object relations are explored: (i) Explicit Relations that represent geometric positions and semantic interactions between objects; and (ii) Implicit Relations that capture the hidden dynamics between image regions. Experiments demonstrate that ReGAT outperforms prior state-of-the-art approaches on both VQA 2.0 and VQA-CP v2 datasets. We further show that ReGAT is compatible to existing VQA architectures, and can be used as a generic relation encoder to boost the model performance for VQA.

Cite

Text

Li et al. "Relation-Aware Graph Attention Network for Visual Question Answering." Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. doi:10.1109/ICCV.2019.01041

Markdown

[Li et al. "Relation-Aware Graph Attention Network for Visual Question Answering." Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.](https://mlanthology.org/iccv/2019/li2019iccv-relationaware/) doi:10.1109/ICCV.2019.01041

BibTeX

@inproceedings{li2019iccv-relationaware,
  title     = {{Relation-Aware Graph Attention Network for Visual Question Answering}},
  author    = {Li, Linjie and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year      = {2019},
  doi       = {10.1109/ICCV.2019.01041},
  url       = {https://mlanthology.org/iccv/2019/li2019iccv-relationaware/}
}