Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Abstract

Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a discriminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localization on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection.

Cite

Text

Zhang et al. "Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries." Conference on Computer Vision and Pattern Recognition, 2017. doi:10.1109/CVPR.2017.122

Markdown

[Zhang et al. "Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries." Conference on Computer Vision and Pattern Recognition, 2017.](https://mlanthology.org/cvpr/2017/zhang2017cvpr-discriminative/) doi:10.1109/CVPR.2017.122

BibTeX

@inproceedings{zhang2017cvpr-discriminative,
  title     = {{Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries}},
  author    = {Zhang, Yuting and Yuan, Luyao and Guo, Yijie and He, Zhiyuan and Huang, I-An and Lee, Honglak},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2017},
  doi       = {10.1109/CVPR.2017.122},
  url       = {https://mlanthology.org/cvpr/2017/zhang2017cvpr-discriminative/}
}