Contrastive Localized Language-Image Pre-Training

Abstract

CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning web-crawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Cite

Text

Chen et al. "Contrastive Localized Language-Image Pre-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Chen et al. "Contrastive Localized Language-Image Pre-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/chen2025icml-contrastive/)

BibTeX

@inproceedings{chen2025icml-contrastive,
  title     = {{Contrastive Localized Language-Image Pre-Training}},
  author    = {Chen, Hong-You and Lai, Zhengfeng and Zhang, Haotian and Wang, Xinze and Eichner, Marcin and You, Keen and Cao, Meng and Zhang, Bowen and Yang, Yinfei and Gan, Zhe},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {8386-8402},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/chen2025icml-contrastive/}
}