Context-Aware Multimodal Pretraining

Abstract

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations surpass significantly more complex optimization-based adaptation schemes.

Cite

Text

Roth et al. "Context-Aware Multimodal Pretraining." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00403

Markdown

[Roth et al. "Context-Aware Multimodal Pretraining." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/roth2025cvpr-contextaware/) doi:10.1109/CVPR52734.2025.00403

BibTeX

@inproceedings{roth2025cvpr-contextaware,
  title     = {{Context-Aware Multimodal Pretraining}},
  author    = {Roth, Karsten and Akata, Zeynep and Damen, Dima and Balazevic, Ivana and Henaff, Olivier J.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {4267-4279},
  doi       = {10.1109/CVPR52734.2025.00403},
  url       = {https://mlanthology.org/cvpr/2025/roth2025cvpr-contextaware/}
}