Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer, Luc Van Gool, Federico Tombari

AAAI 2025 pp. 4230-4238

doi:10.1609/AAAI.V39I4.32444 /aaai/2025/khattak2025aaai-learning/

Abstract

Foundational vision-language models like CLIP are emerging as a promising paradigm in vision due to their excellent generalization. However, adapting these models for downstream tasks while maintaining their generalization remains challenging. In literature, one branch of methods adapts CLIP by learning prompts using images. While effective, these methods often rely on image-label data, which is not always practical, and struggle to generalize to new datasets due to overfitting on few-shot source data. Another approach explores training-free methods by generating class captions from large language models (LLMs) and performing prompt ensembling, but these methods often produce static, class-specific prompts that cannot be transferred to new classes and incur additional costs by generating LLM descriptions for each class separately. In this work, we aim to combine the strengths of both approaches by learning prompts using only text data derived from LLMs. As supervised training of prompts in the image-free setup is non-trivial, we develop a language-only efficient training approach that enables prompts to distill rich contextual knowledge from LLM data. Furthermore, by mapping the LLM contextual text data within the learned prompts, our approach enables zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized and transferable prompts for image tasks using only text data. We perform evaluations on 4 benchmarks, where ProText improves over ensembling methods while being competitive with those using labeled images.

PDF AAAI Semantic Scholar

Cite

Text

Khattak et al. "Learning to Prompt with Text Only Supervision for Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I4.32444

Markdown

[Khattak et al. "Learning to Prompt with Text Only Supervision for Vision-Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/khattak2025aaai-learning/) doi:10.1609/AAAI.V39I4.32444

BibTeX

@inproceedings{khattak2025aaai-learning,
  title     = {{Learning to Prompt with Text Only Supervision for Vision-Language Models}},
  author    = {Khattak, Muhammad Uzair and Naeem, Muhammad Ferjad and Naseer, Muzammal and Van Gool, Luc and Tombari, Federico},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {4230-4238},
  doi       = {10.1609/AAAI.V39I4.32444},
  url       = {https://mlanthology.org/aaai/2025/khattak2025aaai-learning/}
}