uCAP: An Unsupervised Prompting Method for Vision-Language Models

Abstract

This paper addresses a significant limitation that prevents Contrastive Language-Image Pretrained Models (CLIP) from achieving optimal performance on downstream image classification tasks. The key problem with CLIP-style zero-shot classification is that it requires domain-specific context in the form of prompts to better align the class descriptions to the downstream data distribution. In particular, prompts for vision-language models are domain-level texts (e.g., “a centered satellite image of ...”) which, together with the class names, are fed into the text encoder to provide more context for the downstream dataset. These prompts are typically manually tuned, which is time consuming and often sub-optimal. To overcome this bottleneck, this paper proposes uCAP, a method to automatically learn domain-specific prompts/contexts using only unlabeled in-domain images. We achieve this by modeling the generation of images given the class names and a domain-specific prompt with an unsupervised likelihood distribution, and then performing inference of the prompts. We validate the proposed method across various models and datasets, showing that uCAP consistently outperforms manually tuned prompts and related baselines on the evaluated datasets: ImageNet, CIFAR-10, CIFAR-100, OxfordPets (up to 2%), SUN397 (up to 5%), and Caltech101 (up to 3%).

Cite

Text

Nguyen et al. "uCAP: An Unsupervised Prompting Method for Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72904-1_25

Markdown

[Nguyen et al. "uCAP: An Unsupervised Prompting Method for Vision-Language Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/nguyen2024eccv-ucap/) doi:10.1007/978-3-031-72904-1_25

BibTeX

@inproceedings{nguyen2024eccv-ucap,
  title     = {{uCAP: An Unsupervised Prompting Method for Vision-Language Models}},
  author    = {Nguyen, A. Tuan and Tai, Kai Sheng and Chen, Bor-Chun and Shukla, Satya Narayan and Yu, Hanchao and Torr, Philip and Tian, Tai-Peng and Lim, Ser-Nam},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72904-1_25},
  url       = {https://mlanthology.org/eccv/2024/nguyen2024eccv-ucap/}
}