StacMR: Scene-Text Aware Cross-Modal Retrieval

Abstract

Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at europe.naverlabs.com/stacmr.

Cite

Text

Mafla et al. "StacMR: Scene-Text Aware Cross-Modal Retrieval." Winter Conference on Applications of Computer Vision, 2021.

Markdown

[Mafla et al. "StacMR: Scene-Text Aware Cross-Modal Retrieval." Winter Conference on Applications of Computer Vision, 2021.](https://mlanthology.org/wacv/2021/mafla2021wacv-stacmr/)

BibTeX

@inproceedings{mafla2021wacv-stacmr,
  title     = {{StacMR: Scene-Text Aware Cross-Modal Retrieval}},
  author    = {Mafla, Andres and Rezende, Rafael S. and Gomez, Lluis and Larlus, Diane and Karatzas, Dimosthenis},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2021},
  pages     = {2220-2230},
  url       = {https://mlanthology.org/wacv/2021/mafla2021wacv-stacmr/}
}