RANKCLIP: Ranking-Consistent Language-Image Pretraining

Abstract

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.

Cite

Text

Zhang et al. "RANKCLIP: Ranking-Consistent Language-Image Pretraining." International Conference on Computer Vision, 2025.

Markdown

[Zhang et al. "RANKCLIP: Ranking-Consistent Language-Image Pretraining." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhang2025iccv-rankclip/)

BibTeX

@inproceedings{zhang2025iccv-rankclip,
  title     = {{RANKCLIP: Ranking-Consistent Language-Image Pretraining}},
  author    = {Zhang, Yiming and Zhao, Zhuokai and Chen, Zhaorun and Feng, Zhili and Ding, Zenghui and Sun, Yining},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {3874-3884},
  url       = {https://mlanthology.org/iccv/2025/zhang2025iccv-rankclip/}
}