Exploiting Category Names for Few-Shot Classification with Vision-Language Models

Abstract

Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks. Notably, many vision-language models build two encoders (visual and textual) that can map two modalities into the same embedding space. As a result, the learned representations achieve good zero-shot performance on tasks like image classification. However, when there are only a few examples per category, the potential of large vision-language models is often underperformed, mainly due to the gap between a large number of parameters and a relatively small amount of training data. This paper shows that we can significantly improve the performance of few-shot classification by using the category names to initialize the classification head. With the proposed category name initialization method, our model obtains the state-of-the-art performance on a number of few-shot image classification benchmarks (e.g., 87.37\% on ImageNet and 96.08\% on Stanford Cars, both using five-shot learning).

Cite

Text

Xiao et al. "Exploiting Category Names for Few-Shot Classification with Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.

Markdown

[Xiao et al. "Exploiting Category Names for Few-Shot Classification with Vision-Language Models." ICLR 2023 Workshops: MRL, 2023.](https://mlanthology.org/iclrw/2023/xiao2023iclrw-exploiting/)

BibTeX

@inproceedings{xiao2023iclrw-exploiting,
  title     = {{Exploiting Category Names for Few-Shot Classification with Vision-Language Models}},
  author    = {Xiao, Taihong and Wang, Zirui and Cao, Liangliang and Yu, Jiahui and Dai, Shengyang and Yang, Ming-Hsuan},
  booktitle = {ICLR 2023 Workshops: MRL},
  year      = {2023},
  url       = {https://mlanthology.org/iclrw/2023/xiao2023iclrw-exploiting/}
}