More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

Chen, Yuxiao; Yuan, Jianbo; Zhao, Long; Chen, Tianlang; Luo, Rui; Davis, Larry; Metaxas, Dimitris N.

More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

Yuxiao Chen, Jianbo Yuan, Long Zhao, Tianlang Chen, Rui Luo, Larry Davis, Dimitris N. Metaxas

WACV 2023 pp. 4432-4440

/wacv/2023/chen2023wacv-more-a/

Abstract

Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to their capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be generally integrated into existing cross-modal attention models. Additionally, we introduce three metrics, including Attention Precision, Recall, and F1-Score, to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints generally improves the model performance in terms of both retrieval performance and attention metrics.

PDF WACV Semantic Scholar

Cite

Text

Chen et al. "More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Chen et al. "More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/chen2023wacv-more-a/)

BibTeX

@inproceedings{chen2023wacv-more-a,
  title     = {{More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching}},
  author    = {Chen, Yuxiao and Yuan, Jianbo and Zhao, Long and Chen, Tianlang and Luo, Rui and Davis, Larry and Metaxas, Dimitris N.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {4432-4440},
  url       = {https://mlanthology.org/wacv/2023/chen2023wacv-more-a/}
}