Context-Aware Attention Network for Image-Text Retrieval
Abstract
As a typical cross-modal problem, image-text bi-directional retrieval relies heavily on the joint embedding learning and similarity measure for each image-text pair. It remains challenging because prior works seldom explore semantic correspondences between modalities and semantic correlations in a single modality at the same time. In this work, we propose a unified Context-Aware Attention Network (CAAN), which selectively focuses on critical local fragments (regions and words) by aggregating the global context. Specifically, it simultaneously utilizes global inter-modal alignments and intra-modal correlations to discover latent semantic relations. Considering the interactions between images and sentences in the retrieval process, intra-modal correlations are derived from the second-order attention of region-word alignments instead of intuitively comparing the distance between original features. Our method achieves fairly competitive results on two generic image-text retrieval datasets Flickr30K and MS-COCO.
Cite
Text
Zhang et al. "Context-Aware Attention Network for Image-Text Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.00359Markdown
[Zhang et al. "Context-Aware Attention Network for Image-Text Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/zhang2020cvpr-contextaware-a/) doi:10.1109/CVPR42600.2020.00359BibTeX
@inproceedings{zhang2020cvpr-contextaware-a,
title = {{Context-Aware Attention Network for Image-Text Retrieval}},
author = {Zhang, Qi and Lei, Zhen and Zhang, Zhaoxiang and Li, Stan Z.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2020},
doi = {10.1109/CVPR42600.2020.00359},
url = {https://mlanthology.org/cvpr/2020/zhang2020cvpr-contextaware-a/}
}