Multi-Modal Contrastive Learning Adapts to Intrinsic Dimensions of Shared Latent Variables

Abstract

Multi-modal contrastive learning as a self-supervised representation learning technique has achieved great success in foundation model training, such as CLIP~\citep{radford2021learning}. In this paper, we study the theoretical properties of the learned representations from multi-modal contrastive learning beyond linear representations and specific data distributions. Our analysis reveals that, enabled by temperature optimization, multi-modal contrastive learning not only maximizes mutual information between modalities but also adapts to intrinsic dimensions of data, which can be much lower than user-specified dimensions for representation vectors. Experiments on both synthetic and real-world datasets demonstrate the ability of contrastive learning to learn low-dimensional and informative representations, bridging theoretical insights and practical performance.

Cite

Text

Gui et al. "Multi-Modal Contrastive Learning Adapts to Intrinsic Dimensions of Shared Latent Variables." Advances in Neural Information Processing Systems, 2025.

Markdown

[Gui et al. "Multi-Modal Contrastive Learning Adapts to Intrinsic Dimensions of Shared Latent Variables." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gui2025neurips-multimodal/)

BibTeX

@inproceedings{gui2025neurips-multimodal,
  title     = {{Multi-Modal Contrastive Learning Adapts to Intrinsic Dimensions of Shared Latent Variables}},
  author    = {Gui, Yu and Ma, Cong and Ma, Zongming},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/gui2025neurips-multimodal/}
}