Connecting the Dots: Collaborative Fine-Tuning for Black-Box Vision-Language Models

Abstract

With the emergence of pretrained vision-language models (VLMs), considerable efforts have been devoted to fine-tuning them for downstream tasks. Despite the progress made in designing efficient fine-tuning methods, such methods require access to the model’s parameters, which can be challenging as model owners often opt to provide their models as a black box to safeguard model ownership. This paper proposes a Collaborative Fine-Tuning (CraFT) approach for fine-tuning black-box VLMs to downstream tasks, where one only has access to the input prompts and the output predictions of the model. CraFT comprises two modules, a prompt generation module for learning text prompts and a prediction refinement module for enhancing output predictions in residual style. Additionally, we introduce an auxiliary prediction-consistent loss to promote consistent optimization across these modules. These modules are optimized by a novel collaborative training algorithm. Extensive experiments on few-shot classification over 15 datasets demonstrate the superiority of CraFT. The results show that CraFT achieves a decent gain of about 12% with 16-shot datasets and only 8,000 queries. Moreover, CraFT trains faster and uses only about 1/80 of the memory footprint for deployment, while sacrificing only 1.62% compared to the white-box method. Our code is publicly available at https://github.com/mrflogs/CraFT.

Cite

Text

Wang et al. "Connecting the Dots: Collaborative Fine-Tuning for Black-Box Vision-Language Models." International Conference on Machine Learning, 2024.

Markdown

[Wang et al. "Connecting the Dots: Collaborative Fine-Tuning for Black-Box Vision-Language Models." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/wang2024icml-connecting/)

BibTeX

@inproceedings{wang2024icml-connecting,
  title     = {{Connecting the Dots: Collaborative Fine-Tuning for Black-Box Vision-Language Models}},
  author    = {Wang, Zhengbo and Liang, Jian and He, Ran and Wang, Zilei and Tan, Tieniu},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {50931-50943},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/wang2024icml-connecting/}
}