On the Comparison Between Multi-Modal and Single-Modal Contrastive Learning

Abstract

Multi-modal contrastive learning with language supervision has presented a paradigm shift in modern machine learning. By pre-training on a web-scale dataset, multi-modal contrastive learning can learn high-quality representations that exhibit impressive robustness and transferability. Despite its empirical success, the theoretical understanding is still in its infancy, especially regarding its comparison with single-modal contrastive learning. In this work, we introduce a feature learning theory framework that provides a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. Based on a data generation model consisting of signal and noise, our analysis is performed on a ReLU network trained with the InfoMax objective function. Through a trajectory-based optimization analysis and generalization characterization on downstream tasks, we identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Through the cooperation between the two modalities, multi-modal learning can achieve better feature learning, leading to improvements in performance in downstream tasks compared to single-modal learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning. Empirical experiments on both synthetic and real-world datasets further consolidate our theoretical findings.

Cite

Text

Huang et al. "On the Comparison Between Multi-Modal and Single-Modal Contrastive Learning." Neural Information Processing Systems, 2024. doi:10.52202/079017-2592

Markdown

[Huang et al. "On the Comparison Between Multi-Modal and Single-Modal Contrastive Learning." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/huang2024neurips-comparison/) doi:10.52202/079017-2592

BibTeX

@inproceedings{huang2024neurips-comparison,
  title     = {{On the Comparison Between Multi-Modal and Single-Modal Contrastive Learning}},
  author    = {Huang, Wei and Han, Andi and Chen, Yongqiang and Cao, Yuan and Xu, Zhiqiang and Suzuki, Taiji},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2592},
  url       = {https://mlanthology.org/neurips/2024/huang2024neurips-comparison/}
}