Text-Guided Visual Prompt DINO for Generic Segmentation

Abstract

Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybrid prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code are available at https://github.com/WeChatCV/WeVisionOne.

Cite

Text

Guan et al. "Text-Guided Visual Prompt DINO for Generic Segmentation." International Conference on Computer Vision, 2025.

Markdown

[Guan et al. "Text-Guided Visual Prompt DINO for Generic Segmentation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/guan2025iccv-textguided/)

BibTeX

@inproceedings{guan2025iccv-textguided,
  title     = {{Text-Guided Visual Prompt DINO for Generic Segmentation}},
  author    = {Guan, Yuchen and Sun, Chong and Fu, Canmiao and Huang, Zhipeng and Yuan, Chun and Li, Chen},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {21288-21298},
  url       = {https://mlanthology.org/iccv/2025/guan2025iccv-textguided/}
}