Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Abstract
We present F-VLM, a simple open-vocabulary object detection method built uponFrozenVision andLanguageModels. F-VLM simplifies the current multi-stage training pipeline by eliminating the need for knowledge distillation or detection-tailored pretraining. Surprisingly, we observe that a frozen VLM: 1) retains the locality-sensitive features necessary for detection, and 2) is a strong region classifier. We finetune only the detector head and combine the detector and VLM outputs for each region at inference time. F-VLM shows compelling scaling behavior and achieves +6.5 mask AP improvement over the previous state of theart on novel categories of LVIS open-vocabulary detection benchmark. In addition, we demonstrate very competitive results on COCO open-vocabulary detection benchmark and cross-dataset transfer detection, in addition to significant training speed-up and compute savings. Code will be released.
Cite
Text
Kuo et al. "Open-Vocabulary Object Detection upon Frozen Vision and Language Models." International Conference on Learning Representations, 2023.Markdown
[Kuo et al. "Open-Vocabulary Object Detection upon Frozen Vision and Language Models." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/kuo2023iclr-openvocabulary/)BibTeX
@inproceedings{kuo2023iclr-openvocabulary,
title = {{Open-Vocabulary Object Detection upon Frozen Vision and Language Models}},
author = {Kuo, Weicheng and Cui, Yin and Gu, Xiuye and Piergiovanni, Aj and Angelova, Anelia},
booktitle = {International Conference on Learning Representations},
year = {2023},
url = {https://mlanthology.org/iclr/2023/kuo2023iclr-openvocabulary/}
}