DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes
Abstract
Recent advances in vision-language pre-training have significantly enhanced the model capabilities on grounded object detection. However, these studies often pre-train with coarse-grained text prompts, such as plain category names and brief grounded phrases. This limitation curtails the model's capacity for fine-grained linguistic comprehension and leads to a significant decline in performance when faced with detailed descriptions or contextual information. To tackle these problems, we develop DoGA: Detect objects with Grouped Attributes, which employs commonly apparent attributes to bridge different granular semantics and uses specific attributes to identify the object discrepancy. Our DoGA incorporates three principle components: 1) Generation of attribute-based prompts, consisting of linguistic definitions enriched with common-sense visible attributes and hard negative notations deriving from the image-specific attribute features; 2) Paralleled entity fusion and optimization, designed to manage long attribute-based descriptions and negative concepts efficiently; and 3) Prompt-wise grouped training to accommodate model to perform many-to-many assignments, facilitating simultaneous training and inferring with multiple attribute-based synonyms. Extensive experiments demonstrate that training with synonymous attribute-based prompts allows DoGA to generalize multi-granular prompts and surpass previous state-of-the-art approaches, yielding 50.2 on the COCO and 38.0 on the LVIS benchmarks under the zero-short setting. We will make our code publicly available upon acceptance.
Cite
Text
Liu et al. "DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I6.32603Markdown
[Liu et al. "DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/liu2025aaai-doga/) doi:10.1609/AAAI.V39I6.32603BibTeX
@inproceedings{liu2025aaai-doga,
title = {{DoGA: Enhancing Grounded Object Detection via Grouped Pre-Training with Attributes}},
author = {Liu, Yang and Hou, Feng and Peng, Yunjie and Zhang, Gangjian and Zhang, Yao and Xie, Dong and Wang, Peng and Zhang, Yang and Tian, Jiang and Shi, Zhongchao and Fan, Jianping and He, Zhiqiang},
booktitle = {AAAI Conference on Artificial Intelligence},
year = {2025},
pages = {5658-5666},
doi = {10.1609/AAAI.V39I6.32603},
url = {https://mlanthology.org/aaai/2025/liu2025aaai-doga/}
}