Boosting Vision-Language Models with Transduction
Abstract
Transduction is a powerful paradigm that leverages the structure of unlabeled data to boost predictive accuracy. We present TransCLIP, a novel and computationally efficient transductive approach designed for Vision-Language Models (VLMs). TransCLIP is applicable as a plug-and-play module on top of popular inductive zero- and few-shot models, consistently improving their performances. Our new objective function can be viewed as a regularized maximum-likelihood estimation, constrained by a KL divergence penalty that integrates the text-encoder knowledge and guides the transductive learning process. We further derive an iterative Block Majorize-Minimize (BMM) procedure for optimizing our objective, with guaranteed convergence and decoupled sample-assignment updates, yielding computationally efficient transduction for large-scale datasets. We report comprehensive evaluations, comparisons, and ablation studies that demonstrate: (i) Transduction can greatly enhance the generalization capabilities of inductive pretrained zero- and few-shot VLMs; (ii) TransCLIP substantially outperforms standard transductive few-shot learning methods relying solely on vision features, notably due to the KL-based language constraint.
Cite
Text
Zanella et al. "Boosting Vision-Language Models with Transduction." Neural Information Processing Systems, 2024. doi:10.52202/079017-1988Markdown
[Zanella et al. "Boosting Vision-Language Models with Transduction." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zanella2024neurips-boosting/) doi:10.52202/079017-1988BibTeX
@inproceedings{zanella2024neurips-boosting,
title = {{Boosting Vision-Language Models with Transduction}},
author = {Zanella, Maxime and Gérin, Benoît and Ayed, Ismail Ben},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-1988},
url = {https://mlanthology.org/neurips/2024/zanella2024neurips-boosting/}
}