Context-Aware Attention Network for Image-Text Retrieval

Abstract

As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.

Cite

Text

Zhang et al. "Context-Aware Attention Network for Image-Text Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.00359

Markdown

[Zhang et al. "Context-Aware Attention Network for Image-Text Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/zhang2020cvpr-contextaware-a/) doi:10.1109/CVPR42600.2020.00359

BibTeX

@inproceedings{zhang2020cvpr-contextaware-a,
  title     = {{Context-Aware Attention Network for Image-Text Retrieval}},
  author    = {Zhang, Qi and Lei, Zhen and Zhang, Zhaoxiang and Li, Stan Z.},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.00359},
  url       = {https://mlanthology.org/cvpr/2020/zhang2020cvpr-contextaware-a/}
}