Visual Grounding via Accumulated Attention

Deng, Chaorui; Wu, Qi; Wu, Qingyao; Hu, Fuyuan; Lyu, Fan; Tan, Mingkui

doi:10.1109/CVPR.2018.00808

Visual Grounding via Accumulated Attention

Chaorui Deng, Qi Wu, Qingyao Wu, Fuyuan Hu, Fan Lyu, Mingkui Tan

CVPR 2018

doi:10.1109/CVPR.2018.00808 /cvpr/2018/deng2018cvpr-visual/

Abstract

Visual Grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. The query can be a phrase, a sentence or even a multi-round dialogue. There are three main challenges in VG: 1) what is the main focus in a query; 2) how to understand an image; 3) how to locate an object. Most existing methods combine all the information curtly, which may suffer from the problem of information redundancy (i.e. ambiguous query, complicated image and a large number of objects). In this paper, we formulate these challenges as three attention problems and propose an accumulated attention (A-ATT) mechanism to reason among them jointly. Our A-ATT mechanism can circularly accumulate the attention for useful information in image, query, and objects, while the noises are ignored gradually. We evaluate the performance of A-ATT on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and Guesswhat?!), and the experimental results show the superiority of the proposed method in term of accuracy.

PDF CVPR Semantic Scholar

Cite

Text

Deng et al. "Visual Grounding via Accumulated Attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. doi:10.1109/CVPR.2018.00808

Markdown

[Deng et al. "Visual Grounding via Accumulated Attention." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.](https://mlanthology.org/cvpr/2018/deng2018cvpr-visual/) doi:10.1109/CVPR.2018.00808

BibTeX

@inproceedings{deng2018cvpr-visual,
  title     = {{Visual Grounding via Accumulated Attention}},
  author    = {Deng, Chaorui and Wu, Qi and Wu, Qingyao and Hu, Fuyuan and Lyu, Fan and Tan, Mingkui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2018},
  doi       = {10.1109/CVPR.2018.00808},
  url       = {https://mlanthology.org/cvpr/2018/deng2018cvpr-visual/}
}