Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Yongshuo Zong, Qin Zhang, Dongsheng An, Zhihua Li, Xiang Xu, Linghan Xu, Zhuowen Tu, Yifan Xing, Onkar Dabeer

CVPR 2025 pp. 24635-24645

doi:10.1109/CVPR52734.2025.02294 /cvpr/2025/zong2025cvpr-groundv/

Abstract

This work presents a simple yet effective workflow for automatically scaling instruction-following data to elicit pixel-level grounding capabilities of VLMs under complex instructions. In particular, we address five critical real-world challenges in text-instruction-based grounding: hallucinated references, multi-object scenarios, reasoning, multi-granularity, and part-level references. By leveraging knowledge distillation from a pre-trained teacher model, our approach generates high-quality instruction-response pairs linked to existing pixel-level annotations, minimizing the need for costly human annotation. The resulting dataset, Ground-V, captures rich object localization knowledge and nuanced pixel-level referring expressions. Experiment results show that models trained on Ground-V exhibit substantial improvements across diverse grounding tasks. Specifically, incorporating \dataset during training directly achieve an average accuracy boost of 4.4% for LISA and a 7.9% for PSALM across six benchmarks on the gIoU metric. It also sets new state-of-the-art results on standard benchmarks such as RefCOCO/+/g. Notably, on gRefCOCO, we achieve an N-Acc of 83.3%, exceeding the previous state-of-the-art by more than 20%.

PDF CVPR Semantic Scholar

Cite

Text

Zong et al. "Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02294

Markdown

[Zong et al. "Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zong2025cvpr-groundv/) doi:10.1109/CVPR52734.2025.02294

BibTeX

@inproceedings{zong2025cvpr-groundv,
  title     = {{Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels}},
  author    = {Zong, Yongshuo and Zhang, Qin and An, Dongsheng and Li, Zhihua and Xu, Xiang and Xu, Linghan and Tu, Zhuowen and Xing, Yifan and Dabeer, Onkar},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {24635-24645},
  doi       = {10.1109/CVPR52734.2025.02294},
  url       = {https://mlanthology.org/cvpr/2025/zong2025cvpr-groundv/}
}