Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Abstract
This work proposes POMP, a prompt pre-training method for vision-language models. Being memory and computation efficient, POMP enables the learned prompt to condense semantic information for a rich set of visual concepts with over twenty-thousand classes. Once pre-trained, the prompt with a strong transferable ability can be directly plugged into a variety of visual recognition tasks including image classification, semantic segmentation, and object detection, to boost recognition performances in a zero-shot manner. Empirical evaluation shows that POMP achieves state-of-the-art performances on 21 datasets, e.g., 67.0% average accuracy on 10 classification datasets (+3.1% compared to CoOp) and 84.4 hIoU on open-vocabulary Pascal VOC segmentation (+6.9 compared to ZSSeg).
Cite
Text
Ren et al. "Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition." Neural Information Processing Systems, 2023.Markdown
[Ren et al. "Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/ren2023neurips-prompt/)BibTeX
@inproceedings{ren2023neurips-prompt,
title = {{Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition}},
author = {Ren, Shuhuai and Zhang, Aston and Zhu, Yi and Zhang, Shuai and Zheng, Shuai and Li, Mu and Smola, Alexander J and Sun, Xu},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/ren2023neurips-prompt/}
}