Transductive Zero-Shot and Few-Shot CLIP

Abstract

Transductive inference has been widely investigated in few-shot image classification but completely overlooked in the recent fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge in which inference is performed jointly across a mini-batch of unlabeled query samples rather than treating each instance independently. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge in which inference is performed jointly across a mini-batch of unlabeled query samples rather than treating each instance independently. We initially construct informative vision-text probability features leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM) our optimization-based classifying objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm which simultaneously estimates the distribution parameters and class assignments. Extensivenumerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach. On zero-shot tasks with test batches of 75 samples our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally we outperform state-of-the-art methods in the few-shot setting. Code is available at https://github.com/SegoleneMartin/transductive-CLIP.

Cite

Text

Martin et al. "Transductive Zero-Shot and Few-Shot CLIP." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02722

Markdown

[Martin et al. "Transductive Zero-Shot and Few-Shot CLIP." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/martin2024cvpr-transductive/) doi:10.1109/CVPR52733.2024.02722

BibTeX

@inproceedings{martin2024cvpr-transductive,
  title     = {{Transductive Zero-Shot and Few-Shot CLIP}},
  author    = {Martin, Ségolène and Huang, Yunshi and Shakeri, Fereshteh and Pesquet, Jean-Christophe and Ayed, Ismail Ben},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {28816-28826},
  doi       = {10.1109/CVPR52733.2024.02722},
  url       = {https://mlanthology.org/cvpr/2024/martin2024cvpr-transductive/}
}