Multi-Modality Cross Attention Network for Image and Sentence Matching

Abstract

The key of image and sentence matching is to accurately measure the visual-semantic similarity between an image and a sentence. However, most existing methods make use of only the intra-modality relationship within each modality or the inter-modality relationship between image regions and sentence words for the cross-modal matching task. Different from them, in this work, we propose a novel MultiModality Cross Attention (MMCA) Network for image and sentence matching by jointly modeling the intra-modality and inter-modality relationships of image regions and sentence words in a unified deep model. In the proposed MMCA, we design a novel cross-attention mechanism, which is able to exploit not only the intra-modality relationship within each modality, but also the inter-modality relationship between image regions and sentence words to complement and enhance each other for image and sentence matching. Extensive experimental results on two standard benchmarks including Flickr30K and MS-COCO demonstrate that the proposed model performs favorably against state-of-the-art image and sentence matching methods.

Cite

Text

Wei et al. "Multi-Modality Cross Attention Network for Image and Sentence Matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01095

Markdown

[Wei et al. "Multi-Modality Cross Attention Network for Image and Sentence Matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/wei2020cvpr-multimodality/) doi:10.1109/CVPR42600.2020.01095

BibTeX

@inproceedings{wei2020cvpr-multimodality,
  title     = {{Multi-Modality Cross Attention Network for Image and Sentence Matching}},
  author    = {Wei, Xi and Zhang, Tianzhu and Li, Yan and Zhang, Yongdong and Wu, Feng},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01095},
  url       = {https://mlanthology.org/cvpr/2020/wei2020cvpr-multimodality/}
}