Learning Semantic Relationship Among Instances for Image-Text Matching

Zheren Fu, Zhendong Mao, Yan Song, Yongdong Zhang

CVPR 2023 pp. 15159-15168

doi:10.1109/CVPR52729.2023.01455 /cvpr/2023/fu2023cvpr-learning-a/

Abstract

Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities. However, previous studies only focus on capturing fragment-level relation within a sample from a particular modality, e.g., salient regions in an image or text words in a sentence, where they usually pay less attention to capturing instance-level interactions among samples and modalities, e.g., multiple images and texts. In this paper, we argue that sample relations could help learn subtle differences for hard negative instances, and thus transfer shared knowledge for infrequent samples should be promising in obtaining better holistic embeddings. Therefore, we propose a novel hierarchical relation modeling framework (HREM), which explicitly capture both fragment- and instance-level relations to learn discriminative and robust cross-modal embeddings. Extensive experiments on Flickr30K and MS-COCO show our proposed method outperforms the state-of-the-art ones by 4%-10% in terms of rSum.

PDF CVPR Semantic Scholar

Cite

Text

Fu et al. "Learning Semantic Relationship Among Instances for Image-Text Matching." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01455

Markdown

[Fu et al. "Learning Semantic Relationship Among Instances for Image-Text Matching." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/fu2023cvpr-learning-a/) doi:10.1109/CVPR52729.2023.01455

BibTeX

@inproceedings{fu2023cvpr-learning-a,
  title     = {{Learning Semantic Relationship Among Instances for Image-Text Matching}},
  author    = {Fu, Zheren and Mao, Zhendong and Song, Yan and Zhang, Yongdong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {15159-15168},
  doi       = {10.1109/CVPR52729.2023.01455},
  url       = {https://mlanthology.org/cvpr/2023/fu2023cvpr-learning-a/}
}