AttnGrounder: Talking to Cars with Attention
Abstract
We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
Cite
Text
Mittal. "AttnGrounder: Talking to Cars with Attention." European Conference on Computer Vision Workshops, 2020. doi:10.1007/978-3-030-66096-3_6Markdown
[Mittal. "AttnGrounder: Talking to Cars with Attention." European Conference on Computer Vision Workshops, 2020.](https://mlanthology.org/eccvw/2020/mittal2020eccvw-attngrounder/) doi:10.1007/978-3-030-66096-3_6BibTeX
@inproceedings{mittal2020eccvw-attngrounder,
title = {{AttnGrounder: Talking to Cars with Attention}},
author = {Mittal, Vivek},
booktitle = {European Conference on Computer Vision Workshops},
year = {2020},
pages = {62-73},
doi = {10.1007/978-3-030-66096-3_6},
url = {https://mlanthology.org/eccvw/2020/mittal2020eccvw-attngrounder/}
}