LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models
Abstract
Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt "Let's think step by step" improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting "a photo of" filled with a class name enables CLIP to achieve 80% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization. The code will be publicly available upon publication.
Cite
Text
Shi and Yang. "LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00274Markdown
[Shi and Yang. "LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/shi2023iccv-logoprompt/) doi:10.1109/ICCV51070.2023.00274BibTeX
@inproceedings{shi2023iccv-logoprompt,
title = {{LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models}},
author = {Shi, Cheng and Yang, Sibei},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {2932-2941},
doi = {10.1109/ICCV51070.2023.00274},
url = {https://mlanthology.org/iccv/2023/shi2023iccv-logoprompt/}
}