Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator

Abstract

Open-set fine-grained retrieval is an emerging challenge that requires an extra capability to retrieve unknown subcategories during evaluation. However, current works are rooted in the close-set scenarios, where all the subcategories are pre-defined, and make it hard to capture discriminative knowledge from unknown subcategories, consequently failing to handle the inevitable unknown subcategories in open-world scenarios. In this work, we propose a novel Prompting vision-Language Evaluator (PLEor) framework based on the recently introduced contrastive language-image pretraining (CLIP) model, for open-set fine-grained retrieval. PLEor could leverage pre-trained CLIP model to infer the discrepancies encompassing both pre-defined and unknown subcategories, called category-specific discrepancies, and transfer them to the backbone network trained in the close-set scenarios. To make pre-trained CLIP model sensitive to category-specific discrepancies, we design a dual prompt scheme to learn a vision prompt specifying the category-specific discrepancies, and turn random vectors with category names in a text prompt into category-specific discrepancy descriptions. Moreover, a vision-language evaluator is proposed to semantically align the vision and text prompts based on CLIP model, and reinforce each other. In addition, we propose an open-set knowledge transfer to transfer the category-specific discrepancies into the backbone network using knowledge distillation mechanism. A variety of quantitative and qualitative experiments show that our PLEor achieves promising performance on open-set fine-grained retrieval datasets.

Cite

Text

Wang et al. "Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01857

Markdown

[Wang et al. "Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/wang2023cvpr-openset/) doi:10.1109/CVPR52729.2023.01857

BibTeX

@inproceedings{wang2023cvpr-openset,
  title     = {{Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator}},
  author    = {Wang, Shijie and Chang, Jianlong and Li, Haojie and Wang, Zhihui and Ouyang, Wanli and Tian, Qi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {19381-19391},
  doi       = {10.1109/CVPR52729.2023.01857},
  url       = {https://mlanthology.org/cvpr/2023/wang2023cvpr-openset/}
}