OV3D-CG: Open-Vocabulary 3D Instance Segmentation with Contextual Guidance

Abstract

Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D context-aware representations. The representations are processed using Multimodal Large Language Models with Chain-of-Thought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation. Our project page is at: https://vipl-vsu.github.io/OV3D-CG/.

Cite

Text

Zhou et al. "OV3D-CG: Open-Vocabulary 3D Instance Segmentation with Contextual Guidance." International Conference on Computer Vision, 2025.

Markdown

[Zhou et al. "OV3D-CG: Open-Vocabulary 3D Instance Segmentation with Contextual Guidance." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhou2025iccv-ov3dcg/)

BibTeX

@inproceedings{zhou2025iccv-ov3dcg,
  title     = {{OV3D-CG: Open-Vocabulary 3D Instance Segmentation with Contextual Guidance}},
  author    = {Zhou, Mingquan and He, Chen and Wang, Ruiping and Chen, Xilin},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {5305-5314},
  url       = {https://mlanthology.org/iccv/2025/zhou2025iccv-ov3dcg/}
}