PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Abstract

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.

Cite

Text

Feng et al. "PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20077-9_41

Markdown

[Feng et al. "PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/feng2022eccv-promptdet/) doi:10.1007/978-3-031-20077-9_41

BibTeX

@inproceedings{feng2022eccv-promptdet,
  title     = {{PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images}},
  author    = {Feng, Chengjian and Zhong, Yujie and Jie, Zequn and Chu, Xiangxiang and Ren, Haibing and Wei, Xiaolin and Xie, Weidi and Ma, Lin},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20077-9_41},
  url       = {https://mlanthology.org/eccv/2022/feng2022eccv-promptdet/}
}