Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

Cite

Text

Jevtić et al. "Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion." International Conference on Computer Vision, 2025.

Markdown

[Jevtić et al. "Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/jevtic2025iccv-feedforward/)

BibTeX

@inproceedings{jevtic2025iccv-feedforward,
  title     = {{Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion}},
  author    = {Jevtić, Aleksandar and Reich, Christoph and Wimbauer, Felix and Hahn, Oliver and Rupprecht, Christian and Roth, Stefan and Cremers, Daniel},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6784-6796},
  url       = {https://mlanthology.org/iccv/2025/jevtic2025iccv-feedforward/}
}