Multi-Level Attention Networks for Visual Question Answering

Dongfei Yu, Jianlong Fu, Tao Mei, Yong Rui

CVPR 2017

doi:10.1109/CVPR.2017.446 /cvpr/2017/yu2017cvpr-multilevel/

Abstract

Inspired by the recent success of text-based question answering, visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirectional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention, visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.

PDF CVPR Semantic Scholar

Cite

Text

Yu et al. "Multi-Level Attention Networks for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.446

Markdown

[Yu et al. "Multi-Level Attention Networks for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/yu2017cvpr-multilevel/) doi:10.1109/CVPR.2017.446

BibTeX

@inproceedings{yu2017cvpr-multilevel,
  title     = {{Multi-Level Attention Networks for Visual Question Answering}},
  author    = {Yu, Dongfei and Fu, Jianlong and Mei, Tao and Rui, Yong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2017},
  doi       = {10.1109/CVPR.2017.446},
  url       = {https://mlanthology.org/cvpr/2017/yu2017cvpr-multilevel/}
}