From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Cao, Ang; Arnaud, Sergio; Maksymets, Oleksandr; Yang, Jianing; Jain, Ayush; Martin, Ada; Berges, Vincent-Pierre; Mcvay, Paul; Partsey, Ruslan; Rajeswaran, Aravind; Meier, Franziska; Johnson, Justin; Park, Jeong Joon; Sax, Alexander

From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs

Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Ada Martin, Vincent-Pierre Berges, Paul Mcvay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax

ICML 2025 pp. 6505-6521

/icml/2025/cao2025icml-thousands/

Abstract

3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes–a six-order-of-magnitude gap that severely limits performance. We introduce LIFT-GS, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7% mAP on open-vocabulary instance segmentation (vs. 20.2% prior SOTA) and consistent 10-30% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2$\times$, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Cao et al. "From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Cao et al. "From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/cao2025icml-thousands/)

BibTeX

@inproceedings{cao2025icml-thousands,
  title     = {{From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs}},
  author    = {Cao, Ang and Arnaud, Sergio and Maksymets, Oleksandr and Yang, Jianing and Jain, Ayush and Martin, Ada and Berges, Vincent-Pierre and Mcvay, Paul and Partsey, Ruslan and Rajeswaran, Aravind and Meier, Franziska and Johnson, Justin and Park, Jeong Joon and Sax, Alexander},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {6505-6521},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/cao2025icml-thousands/}
}