Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations

Abstract

Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2\% and 4.8\%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7\% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP’s language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.

Cite

Text

Zhang et al. "Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-concerto/)

BibTeX

@inproceedings{zhang2025neurips-concerto,
  title     = {{Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations}},
  author    = {Zhang, Yujia and Wu, Xiaoyang and Lao, Yixing and Wang, Chengyao and Tian, Zhuotao and Wang, Naiyan and Zhao, Hengshuang},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-concerto/}
}