Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

Abstract

In this paper, we develop an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for modalities fusion. We first pre-train Grounding DINO on large-scale datasets, including object detection data, grounding data, and caption data, and evaluate the model on both open-set object detection and referring object detection benchmarks. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a 52.5 AP on the COCO zero-shot1 detection benchmark. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. We release some checkpoints and inference codes at https://github.com/ IDEA-Research/GroundingDINO. 1 In this paper, ‘zero-shot’ refers to scenarios where the training split of the test dataset is not utilized in the training process.

Cite

Text

Liu et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72970-6_3

Markdown

[Liu et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/liu2024eccv-grounding/) doi:10.1007/978-3-031-72970-6_3

BibTeX

@inproceedings{liu2024eccv-grounding,
  title     = {{Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection}},
  author    = {Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Jiang, Qing and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and Zhang, Lei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72970-6_3},
  url       = {https://mlanthology.org/eccv/2024/liu2024eccv-grounding/}
}