Densely Connected Attention Flow for Visual Question Answering

Abstract

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.

Cite

Text

Liu et al. "Densely Connected Attention Flow for Visual Question Answering." International Joint Conference on Artificial Intelligence, 2019. doi:10.24963/IJCAI.2019/122

Markdown

[Liu et al. "Densely Connected Attention Flow for Visual Question Answering." International Joint Conference on Artificial Intelligence, 2019.](https://mlanthology.org/ijcai/2019/liu2019ijcai-densely/) doi:10.24963/IJCAI.2019/122

BibTeX

@inproceedings{liu2019ijcai-densely,
  title     = {{Densely Connected Attention Flow for Visual Question Answering}},
  author    = {Liu, Fei and Liu, Jing and Fang, Zhiwei and Hong, Richang and Lu, Hanqing},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2019},
  pages     = {869-875},
  doi       = {10.24963/IJCAI.2019/122},
  url       = {https://mlanthology.org/ijcai/2019/liu2019ijcai-densely/}
}