Meta CLIP 2: A Worldwide Scaling Recipe

Abstract

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling mutual benefits from English and non-English world data. In zero-shot ImageNet classification, Meta CLIP 2 ViT-H/14 surpasses its English-only counterpart by 0.8% and mSigLIP by 0.7%, and surprisingly sets new state-of-the-art without system-level confounding factors (e.g., translation, bespoke architecture changes) on multilingual benchmarks, such as CVQA with 57.4%, Babel-ImageNet with 50.2% and XM3600 with 64.3% on image-to-text retrieval. Code and model are available at https://github.com/facebookresearch/MetaCLIP.

Cite

Text

Chuang et al. "Meta CLIP 2: A Worldwide Scaling Recipe." Advances in Neural Information Processing Systems, 2025.

Markdown

[Chuang et al. "Meta CLIP 2: A Worldwide Scaling Recipe." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chuang2025neurips-meta/)

BibTeX

@inproceedings{chuang2025neurips-meta,
  title     = {{Meta CLIP 2: A Worldwide Scaling Recipe}},
  author    = {Chuang, Yung-Sung and Li, Yang and Wang, Dong and Yeh, Ching-Feng and Lyu, Kehan and Raghavendra, Ramya and Glass, James R. and Huang, Lifei and Weston, Jason E and Zettlemoyer, Luke and Chen, Xinlei and Liu, Zhuang and Xie, Saining and Yih, Wen-tau and Li, Shang-Wen and Xu, Hu},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/chuang2025neurips-meta/}
}