Where to Look: Focus Regions for Visual Question Answering
Abstract
We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method maps textual queries and visual features from various regions into a shared space where they are compared for relevance with an inner product. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the recently released VQA dataset, which features free-form human-annotated questions and answers.
Cite
Text
Shih et al. "Where to Look: Focus Regions for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2016. doi:10.1109/CVPR.2016.499Markdown
[Shih et al. "Where to Look: Focus Regions for Visual Question Answering." Conference on Computer Vision and Pattern Recognition, 2016.](https://mlanthology.org/cvpr/2016/shih2016cvpr-look/) doi:10.1109/CVPR.2016.499BibTeX
@inproceedings{shih2016cvpr-look,
title = {{Where to Look: Focus Regions for Visual Question Answering}},
author = {Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2016},
doi = {10.1109/CVPR.2016.499},
url = {https://mlanthology.org/cvpr/2016/shih2016cvpr-look/}
}