More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching

Abstract

Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to their capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be generally integrated into existing cross-modal attention models. Additionally, we introduce three metrics, including Attention Precision, Recall, and F1-Score, to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints generally improves the model performance in terms of both retrieval performance and attention metrics.

Cite

Text

Chen et al. "More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching." Winter Conference on Applications of Computer Vision, 2023.

Markdown

[Chen et al. "More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching." Winter Conference on Applications of Computer Vision, 2023.](https://mlanthology.org/wacv/2023/chen2023wacv-more-a/)

BibTeX

@inproceedings{chen2023wacv-more-a,
  title     = {{More than Just Attention: Improving Cross-Modal Attentions with Contrastive Constraints for Image-Text Matching}},
  author    = {Chen, Yuxiao and Yuan, Jianbo and Zhao, Long and Chen, Tianlang and Luo, Rui and Davis, Larry and Metaxas, Dimitris N.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2023},
  pages     = {4432-4440},
  url       = {https://mlanthology.org/wacv/2023/chen2023wacv-more-a/}
}