Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Abstract

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene understanding, self-supervised methods are typically only used as a weight initialization step for task-specific fine-tuning, limiting their utility for general-purpose feature extraction. This paper aims to address this shortcoming by proposing a robust evaluation protocol specifically designed to assess the quality of self-supervised features for 3D scene understanding. Our protocol uses multi-resolution feature sampling of hierarchical models to create rich point-level representations that capture the semantic capabilities of the model and, hence, are suitable for evaluation with linear probing and nearest-neighbor methods. Furthermore, we introduce the first self-supervised model that performs similarly to supervised models when only off-the-shelf features are used in a linear probing setup. In particular, our model is trained natively in 3D with a novel self-supervised approach based on a Masked Scene Modeling objective, which reconstructs deep features of masked patches in a bottom-up manner and is specifically tailored to hierarchical 3D models. Our experiments not only demonstrate that our method achieves competitive performance to supervised models, but also surpasses existing self-supervised approaches by a large margin.

Cite

Text

Hermosilla et al. "Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01382

Markdown

[Hermosilla et al. "Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/hermosilla2025cvpr-masked/) doi:10.1109/CVPR52734.2025.01382

BibTeX

@inproceedings{hermosilla2025cvpr-masked,
  title     = {{Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding}},
  author    = {Hermosilla, Pedro and Stippel, Christian and Sick, Leon},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {14835-14844},
  doi       = {10.1109/CVPR52734.2025.01382},
  url       = {https://mlanthology.org/cvpr/2025/hermosilla2025cvpr-masked/}
}