Where to Look: Focus Regions for Visual Question Answering

Abstract

We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method maps textual queries and visual features from various regions into a shared space where they are compared for relevance with an inner product. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the recently released VQA dataset, which features free-form human-annotated questions and answers.

Cite

Text

Shih et al. "Where to Look: Focus Regions for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.499

Markdown

[Shih et al. "Where to Look: Focus Regions for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/shih2016cvpr-look/) doi:10.1109/CVPR.2016.499

BibTeX

@inproceedings{shih2016cvpr-look,
  title     = {{Where to Look: Focus Regions for Visual Question Answering}},
  author    = {Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2016},
  doi       = {10.1109/CVPR.2016.499},
  url       = {https://mlanthology.org/cvpr/2016/shih2016cvpr-look/}
}