Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation
Abstract
We tackle open-vocabulary 3D scene segmentation tasks by introducing a novel data generation pipeline and training framework. Our work targets three essential aspects required for an effective dataset: precise 3D region segmentation, comprehensive textual descriptions, and sufficient dataset scale. By leveraging state-of-the-art open-vocabulary image segmentation models and region-aware vision-language models (VLM), we develop an automatic pipeline capable of producing high-quality 3D mask-text pairs. Applying this pipeline to multiple 3D scene datasets, we create Mosaic3D-5.6M, a dataset of more than 30K annotated scenes with 5.6M mask-text pairs - significantly larger than existing datasets. Building on these data, we propose Mosaic3D, a 3D visiual foundation model (3D-VFM) combining a 3D encoder trained with contrastive learning and a lightweight mask decoder for open-vocabulary 3D semantic and instance segmentation. Our approach achieves state-of-the-art results on open-vocabulary 3D semantic and instance segmentation benchmarks including ScanNet200, Matterport3D, and ScanNet++, with ablation studies validating the effectiveness of our large-scale training data. https://nvlabs.github.io/Mosaic3D/
Cite
Text
Lee et al. "Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01315Markdown
[Lee et al. "Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/lee2025cvpr-mosaic3d/) doi:10.1109/CVPR52734.2025.01315BibTeX
@inproceedings{lee2025cvpr-mosaic3d,
title = {{Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation}},
author = {Lee, Junha and Park, Chunghyun and Choe, Jaesung and Wang, Yu-Chiang Frank and Kautz, Jan and Cho, Minsu and Choy, Chris},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {14089-14101},
doi = {10.1109/CVPR52734.2025.01315},
url = {https://mlanthology.org/cvpr/2025/lee2025cvpr-mosaic3d/}
}