SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding
Abstract
3D vision-language (3dvl) grounding, which aims to align language with 3D physical environments, stands as a cornerstone in developing embodied agents. In comparison to recent advancements in the 2D domain, grounding language in 3D scenes faces two significant challenges: (i) the scarcity of paired 3dvl data to support grounded learning of 3D scenes, especially considering complexities within diverse object configurations, rich attributes, and intricate relationships; and (ii) the absence of a unified learning framework to distill knowledge from grounded 3D data. In this work, we aim to address these major challenges in 3D-VL by examining the potential of systematically upscaling 3D-VL learning in indoor scenes. We introduce the first million-scale 3D-VL dataset, , encompassing indoor scenes and comprising vision-language pairs collected from both human annotations and our scalable scene-graph-based generation approach. We demonstrate that this scaling allows for a unified pre-training framework, Grounded Pre-training for Scenes (), for 3D-VL learning. Through extensive experiments, we showcase the effectiveness of by achieving performance on existing 3D visual grounding and question-answering benchmarks. We also show that the data scaling effect is not limited to , but is generally beneficial for models on tasks like 3D semantic segmentation. The vast potential of and is unveiled through zero-shot transfer experiments in challenging 3dvl tasks.
Cite
Text
Jia et al. "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72673-6_16Markdown
[Jia et al. "SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/jia2024eccv-sceneverse/) doi:10.1007/978-3-031-72673-6_16BibTeX
@inproceedings{jia2024eccv-sceneverse,
title = {{SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding}},
author = {Jia, Baoxiong and Chen, Yixin and Yu, Huangyue and Wang, Yan and Niu, Xuesong and Liu, Tengyu and Li, Qing and Huang, Siyuan},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72673-6_16},
url = {https://mlanthology.org/eccv/2024/jia2024eccv-sceneverse/}
}