OpenScene: 3D Scene Understanding with Open Vocabularies

Abstract

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

Cite

Text

Peng et al. "OpenScene: 3D Scene Understanding with Open Vocabularies." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00085

Markdown

[Peng et al. "OpenScene: 3D Scene Understanding with Open Vocabularies." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/peng2023cvpr-openscene/) doi:10.1109/CVPR52729.2023.00085

BibTeX

@inproceedings{peng2023cvpr-openscene,
  title     = {{OpenScene: 3D Scene Understanding with Open Vocabularies}},
  author    = {Peng, Songyou and Genova, Kyle and Jiang, Chiyu “Max” and Tagliasacchi, Andrea and Pollefeys, Marc and Funkhouser, Thomas},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {815-824},
  doi       = {10.1109/CVPR52729.2023.00085},
  url       = {https://mlanthology.org/cvpr/2023/peng2023cvpr-openscene/}
}