Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization

Abstract

Several works have proposed to learn a two-path neural network that maps images and texts, respectively, to a same shared Euclidean space where geometry captures useful semantic relationships. Such a multi-modal embedding can be trained and used for various tasks, notably image captioning. In the present work, we introduce a new architecture of this type, with a visual path that leverages recent space-aware pooling mechanisms. Combined with a textual path which is jointly trained from scratch, our semantic-visual embedding offers a versatile model. Once trained under the supervision of captioned images, it yields new state-of-the-art performance on cross-modal retrieval. It also allows the localization of new concepts from the embedding space into any input image, delivering state-of-the-art result on the visual grounding of phrases.

Cite

Text

Engilberge et al. "Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. doi:10.1109/CVPR.2018.00419

Markdown

[Engilberge et al. "Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.](https://mlanthology.org/cvpr/2018/engilberge2018cvpr-finding/) doi:10.1109/CVPR.2018.00419

BibTeX

@inproceedings{engilberge2018cvpr-finding,
  title     = {{Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization}},
  author    = {Engilberge, Martin and Chevallier, Louis and Pérez, Patrick and Cord, Matthieu},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2018},
  doi       = {10.1109/CVPR.2018.00419},
  url       = {https://mlanthology.org/cvpr/2018/engilberge2018cvpr-finding/}
}