Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
Abstract
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.
Cite
Text
Lu et al. "Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models." International Conference on Machine Learning, 2024.Markdown
[Lu et al. "Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/lu2024icml-beyond/)BibTeX
@inproceedings{lu2024icml-beyond,
title = {{Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models}},
author = {Lu, Zhihe and Bai, Jiawang and Li, Xin and Xiao, Zeyu and Wang, Xinchao},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {32924-32938},
volume = {235},
url = {https://mlanthology.org/icml/2024/lu2024icml-beyond/}
}