Visual Prompt Tuning
Abstract
The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, i.e. full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter-efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training scales, while reducing per-task storage cost. Code is available at https://github.com/kmnp/vpt.
Cite
Text
Jia et al. "Visual Prompt Tuning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19827-4_41Markdown
[Jia et al. "Visual Prompt Tuning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/jia2022eccv-visual/) doi:10.1007/978-3-031-19827-4_41BibTeX
@inproceedings{jia2022eccv-visual,
title = {{Visual Prompt Tuning}},
author = {Jia, Menglin and Tang, Luming and Chen, Bor-Chun and Cardie, Claire and Belongie, Serge and Hariharan, Bharath and Lim, Ser-Nam},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-19827-4_41},
url = {https://mlanthology.org/eccv/2022/jia2022eccv-visual/}
}