ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Abstract

Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose ReferDINO, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9% (\mathcal J &\mathcal F ) on Ref-YouTube-VOS) with real-time inference speed (51 FPS).

Cite

Text

Liang et al. "ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations." International Conference on Computer Vision, 2025.

Markdown

[Liang et al. "ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liang2025iccv-referdino/)

BibTeX

@inproceedings{liang2025iccv-referdino,
  title     = {{ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations}},
  author    = {Liang, Tianming and Lin, Kun-Yu and Tan, Chaolei and Zhang, Jianguo and Zheng, Wei-Shi and Hu, Jian-Fang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20009-20019},
  url       = {https://mlanthology.org/iccv/2025/liang2025iccv-referdino/}
}