MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

Abstract

We present MetaSpatial, the first reinforcement learning (RL) framework for enhancing 3D spatial reasoning in vision-language models (VLMs), enabling real-time 3D scene layout generation without post-processing. MetaSpatial addresses two key challenges: (i) the need for extensive post-processing, as existing VLMs lack inherent 3D spatial reasoning to generate realistic layouts; and (ii) the inefficiency of supervised fine-tuning (SFT) for layout generation due to scarcity of perfect annotations. Our core contribution is the 3D Spatial Policy Optimization (3D-SPO) algorithm, which incorporates physics-aware modulation into advantage estimates at the object level and trajectory-level reward from a training-only multi-turn refinement pipeline. This design enhances temporal credit assignment and encourages spatially consistent policy learning. Empirical evaluations across models of varying scales demonstrate that MetaSpatial improves spatial coherence, physical plausibility, and formatting stability, leading to more realistic and functionally coherent object placements applicable to metaverse environments.

Cite

Text

Pan and Liu. "MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse." International Conference on Learning Representations, 2026.

Markdown

[Pan and Liu. "MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/pan2026iclr-metaspatial/)

BibTeX

@inproceedings{pan2026iclr-metaspatial,
  title     = {{MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse}},
  author    = {Pan, Zhenyu and Liu, Han},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/pan2026iclr-metaspatial/}
}