DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment

Abstract

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13x more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.

Cite

Text

Yao et al. "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.02250

Markdown

[Yao et al. "DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yao2023cvpr-detclipv2/) doi:10.1109/CVPR52729.2023.02250

BibTeX

@inproceedings{yao2023cvpr-detclipv2,
  title     = {{DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment}},
  author    = {Yao, Lewei and Han, Jianhua and Liang, Xiaodan and Xu, Dan and Zhang, Wei and Li, Zhenguo and Xu, Hang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {23497-23506},
  doi       = {10.1109/CVPR52729.2023.02250},
  url       = {https://mlanthology.org/cvpr/2023/yao2023cvpr-detclipv2/}
}