Active Learning for Vision Language Models

Abstract

Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets and significantly enhances the zero-shot performance of VLMs.

Cite

Text

Safaei and Patel. "Active Learning for Vision Language Models." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Safaei and Patel. "Active Learning for Vision Language Models." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/safaei2025wacv-active/)

BibTeX

@inproceedings{safaei2025wacv-active,
  title     = {{Active Learning for Vision Language Models}},
  author    = {Safaei, Bardia and Patel, Vishal M.},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {4902-4912},
  url       = {https://mlanthology.org/wacv/2025/safaei2025wacv-active/}
}