VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

NeurIPS 2025

/neurips/2025/cheng2025neurips-vamp/

Abstract

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Cheng and Han. "VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Cheng and Han. "VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/cheng2025neurips-vamp/)

BibTeX

@inproceedings{cheng2025neurips-vamp,
  title     = {{VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models}},
  author    = {Cheng, Silin and Han, Kai},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/cheng2025neurips-vamp/}
}