Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Abstract

Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work we introduce Language Embedded 3D Gaussians a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians we propose a dedicated quantization scheme that drastically alleviates the memory requirement and a novel embedding procedure that achieves smoother yet high accuracy query countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations while maintaining real-time rendering frame rates on a single desktop GPU.

Cite

Text

Shi et al. "Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.00510

Markdown

[Shi et al. "Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/shi2024cvpr-language/) doi:10.1109/CVPR52733.2024.00510

BibTeX

@inproceedings{shi2024cvpr-language,
  title     = {{Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding}},
  author    = {Shi, Jin-Chuan and Wang, Miao and Duan, Hao-Bin and Guan, Shao-Hua},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {5333-5343},
  doi       = {10.1109/CVPR52733.2024.00510},
  url       = {https://mlanthology.org/cvpr/2024/shi2024cvpr-language/}
}