LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models
Abstract
Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset will be available.
Cite
Text
Fu et al. "LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01396Markdown
[Fu et al. "LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/fu2025cvpr-llmdet/) doi:10.1109/CVPR52734.2025.01396BibTeX
@inproceedings{fu2025cvpr-llmdet,
title = {{LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models}},
author = {Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {14987-14997},
doi = {10.1109/CVPR52734.2025.01396},
url = {https://mlanthology.org/cvpr/2025/fu2025cvpr-llmdet/}
}