Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling

Abstract

The advancement of vision-language models particularly the Contrastive Language-Image Pre-training (CLIP) model has revolutionized the field of machine learning by enabling robust zero-shot learning capabilities. These capabilities allow models to understand and respond to previously unseen data without task-specific training. However adapting CLIP to integrate specialized knowledge from various domains while retaining its zero-shot capabilities remains a significant challenge. To address this we introduce a novel prompt ensemble learning approach called Group-wise Prompt Ensemble (GPE). This method aims to enhance CLIP's zero-shot capabilities by incorporating new domain knowledge while improving its adaptability and robustness against data distribution shifts. Our approach hinges on three main strategies: prompt grouping with masked attention to optimize CLIP's adaptability while safeguarding its zero-shot capabilities; the incorporation of auxiliary prompts for the seamless integration of new domain insights without disrupting the original model's representation; and an ensemble learning strategy that effectively merges original and new knowledge. Through rigorous experimentation including more challenging cross-dataset transfer evaluations our GPE method redefines the benchmarks for the adaptability and efficiency of vision-language models surpassing existing models across various scenarios.

Cite

Text

Kim et al. "Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Kim et al. "Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/kim2025wacv-retaining/)

BibTeX

@inproceedings{kim2025wacv-retaining,
  title     = {{Retaining and Enhancing Pre-Trained Knowledge in Vision-Language Models with Prompt Ensembling}},
  author    = {Kim, Donggeun and Jo, Yujin and Lee, Myungjoo and Kim, Taesup},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {5550-5559},
  url       = {https://mlanthology.org/wacv/2025/kim2025wacv-retaining/}
}