OVSeg3R: Learn Open-Vocabulary Instance Segmentation from 2D via 3D Reconstruction

Abstract

In this paper, we propose a training scheme called OVSeg3R to learn open-vocabulary 3D instance segmentation from well-studied 2D perception models with the aid of 3D reconstruction. OVSeg3R directly adopts reconstructed scenes from 2D videos as input, avoiding costly manual adjustment while aligning input with real-world applications. By exploiting the 2D to 3D correspondences provided by 3D reconstruction models, OVSeg3R projects each view's 2D instance mask predictions, obtained from an open-vocabulary 2D model, onto 3D to generate annotations for the view's corresponding sub-scene. To avoid incorrectly introduced false positives as supervision due to partial annotations from 2D to 3D, we propose a View-wise Instance Partition algorithm, which partitions predictions to their respective views for supervision, stabilizing the training process. Furthermore, since 3D reconstruction models tend to over-smooth geometric details, clustering reconstructed points into representative super-points based solely on geometry, as commonly done in mainstream 3D segmentation methods, may overlook geometrically non-salient objects. We therefore introduce 2D Instance Boundary-aware Superpoint, which leverages 2D masks to constrain the superpoint clustering, preventing superpoints from violating instance boundaries. With these designs, OVSeg3R not only extends a state-of-the-art closed-vocabulary 3D instance segmentation model to open-vocabulary, but also substantially narrows the performance gap between tail and head classes, ultimately leading to an overall improvement of +2.3 mAP on the ScanNet200 benchmark. Furthermore, under the standard open-vocabulary setting, OVSeg3R surpasses previous methods by about +7.1 mAP on the novel classes, further validating its effectiveness.

Cite

Text

Li et al. "OVSeg3R: Learn Open-Vocabulary Instance Segmentation from 2D via 3D Reconstruction." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "OVSeg3R: Learn Open-Vocabulary Instance Segmentation from 2D via 3D Reconstruction." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-ovseg3r/)

BibTeX

@inproceedings{li2026iclr-ovseg3r,
  title     = {{OVSeg3R: Learn Open-Vocabulary Instance Segmentation from 2D via 3D Reconstruction}},
  author    = {Li, Hongyang and Qu, Jinyuan and Zhang, Lei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-ovseg3r/}
}