Universal Weighting Metric Learning for Cross-Modal Matching

Abstract

Cross-modal matching has been a highlighted research topic in both vision and language areas. Learning appropriate mining strategy to sample and weight informative pairs is crucial for the cross-modal matching performance. However, most existing metric learning methods are developed for unimodal matching, which is unsuitable for cross-modal matching on multimodal data with heterogeneous features. To address this problem, we propose a simple and interpretable universal weighting framework for cross-modal matching, which provides a tool to analyze the interpretability of various loss functions. Furthermore, we introduce a new polynomial loss under the universal weighting framework, which defines a weight function for the positive and negative informative pairs respectively. Experimental results on two image-text matching benchmarks and two video-text matching benchmarks validate the efficacy of the proposed method.

Cite

Text

Wei et al. "Universal Weighting Metric Learning for Cross-Modal Matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. doi:10.1109/CVPR42600.2020.01302

Markdown

[Wei et al. "Universal Weighting Metric Learning for Cross-Modal Matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.](https://mlanthology.org/cvpr/2020/wei2020cvpr-universal/) doi:10.1109/CVPR42600.2020.01302

BibTeX

@inproceedings{wei2020cvpr-universal,
  title     = {{Universal Weighting Metric Learning for Cross-Modal Matching}},
  author    = {Wei, Jiwei and Xu, Xing and Yang, Yang and Ji, Yanli and Wang, Zheng and Shen, Heng Tao},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2020},
  doi       = {10.1109/CVPR42600.2020.01302},
  url       = {https://mlanthology.org/cvpr/2020/wei2020cvpr-universal/}
}