Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi, Nicu Sebe

CVPR 2025 pp. 19390-19400

doi:10.1109/CVPR52734.2025.01806 /cvpr/2025/li2025cvpr-crossmodal/

Abstract

The lack of a large-scale 3D-text corpus has led recent works to distill open-vocabulary knowledge from vision-language models (VLMs). However, these methods typically rely on a single VLM to align the feature spaces of 3D models within a common language space, which limits the potential of 3D models to leverage the diverse spatial and semantic capabilities encapsulated in various foundation models. In this paper, we propose Cross-modal and Uncertainty-aware Agglomeration for Open-vocabulary 3D Scene Understanding dubbed CUA-O3D, the first model to integrate multiple foundation models--such as CLIP, DINOv2, and Stable Diffusion--into 3D scene understanding. We further introduce a deterministic uncertainty estimation to adaptively distill and harmonize the heterogeneous 2D feature embeddings from these models. Our method addresses two key challenges: (1) incorporating semantic priors from VLMs alongside the geometric knowledge of spatially-aware vision foundation models, and (2) using a novel deterministic uncertainty estimation to capture model-specific uncertainties across diverse semantic and geometric sensitivities, helping to reconcile heterogeneous representations during training. Extensive experiments on ScanNetV2 and Matterport3D demonstrate that our method not only advances open-vocabulary segmentation but also achieves robust cross-domain alignment and competitive spatial perception capabilities.

PDF CVPR Semantic Scholar

Cite

Text

Li et al. "Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01806

Markdown

[Li et al. "Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-crossmodal/) doi:10.1109/CVPR52734.2025.01806

BibTeX

@inproceedings{li2025cvpr-crossmodal,
  title     = {{Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding}},
  author    = {Li, Jinlong and Saltori, Cristiano and Poiesi, Fabio and Sebe, Nicu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {19390-19400},
  doi       = {10.1109/CVPR52734.2025.01806},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-crossmodal/}
}