Deep Modular Co-Attention Networks for Visual Question Answering

Yu, Zhou; Yu, Jun; Cui, Yuhao; Tao, Dacheng; Tian, Qi

doi:10.1109/CVPR.2019.00644

Deep Modular Co-Attention Networks for Visual Question Answering

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, Qi Tian

CVPR 2019

doi:10.1109/CVPR.2019.00644 /cvpr/2019/yu2019cvpr-deep/

Abstract

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63% overall accuracy on the test-dev set.

PDF CVPR Semantic Scholar

Cite

Text

Yu et al. "Deep Modular Co-Attention Networks for Visual Question Answering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. doi:10.1109/CVPR.2019.00644

Markdown

[Yu et al. "Deep Modular Co-Attention Networks for Visual Question Answering." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.](https://mlanthology.org/cvpr/2019/yu2019cvpr-deep/) doi:10.1109/CVPR.2019.00644

BibTeX

@inproceedings{yu2019cvpr-deep,
  title     = {{Deep Modular Co-Attention Networks for Visual Question Answering}},
  author    = {Yu, Zhou and Yu, Jun and Cui, Yuhao and Tao, Dacheng and Tian, Qi},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2019},
  doi       = {10.1109/CVPR.2019.00644},
  url       = {https://mlanthology.org/cvpr/2019/yu2019cvpr-deep/}
}