Learning Visual N-Grams from Web Data

Abstract

Real-world image recognition systems need to recognize tens of thousands of classes that constitute a plethora of visual concepts. The traditional approach of annotating thousands of images per class for training is infeasible in such a scenario, prompting the use of webly supervised data. This paper explores the training of image-recognition systems on large numbers of images and associated user comments. In particular, we develop visual n-gram models that can predict arbitrary phrases that are relevant to the content of an image. Our visual n-gram models are feed-forward convolutional networks trained using new loss functions that are inspired by n-gram models commonly used in language modeling. We demonstrate the merits of our models in phrase prediction, phrase-based image retrieval, relating images and captions, and zero-shot transfer.

Cite

Text

Li et al. "Learning Visual N-Grams from Web Data." International Conference on Computer Vision, 2017. doi:10.1109/ICCV.2017.449

Markdown

[Li et al. "Learning Visual N-Grams from Web Data." International Conference on Computer Vision, 2017.](https://mlanthology.org/iccv/2017/li2017iccv-learning-b/) doi:10.1109/ICCV.2017.449

BibTeX

@inproceedings{li2017iccv-learning-b,
  title     = {{Learning Visual N-Grams from Web Data}},
  author    = {Li, Ang and Jabri, Allan and Joulin, Armand and van der Maaten, Laurens},
  booktitle = {International Conference on Computer Vision},
  year      = {2017},
  doi       = {10.1109/ICCV.2017.449},
  url       = {https://mlanthology.org/iccv/2017/li2017iccv-learning-b/}
}