LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models

Abstract

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset will be available.

Cite

Text

Fu et al. "LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01396

Markdown

[Fu et al. "LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/fu2025cvpr-llmdet/) doi:10.1109/CVPR52734.2025.01396

BibTeX

@inproceedings{fu2025cvpr-llmdet,
  title     = {{LLMDet: Learning Strong Open-Vocabulary Object Detectors Under the Supervision of Large Language Models}},
  author    = {Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14987-14997},
  doi       = {10.1109/CVPR52734.2025.01396},
  url       = {https://mlanthology.org/cvpr/2025/fu2025cvpr-llmdet/}
}