Spatially Guided Training for Vision-Language-Action Model

Abstract

Large vision–language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce SP-VLA, a dual-system **V**ision–**L**anguage–**A**ction framework that leverages **S**patial **P**riors as a bridge between linguistic instructions and embodiment-specific control. introduce SP-VLA aligns action learning with spatial priors through two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, introduce SP-VLA achieves substantial improvements over vanilla VLA, with performance increasing from $66.1{\rightarrow}84.6$ on Google Robot and from $54.7{\rightarrow}73.2$ on WidowX, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. We will release code, data, and model checkpoints to support future research. See more visualization results at the anonymous page: https://sp-vla-anonymous.vercel.app

Cite

Text

Ye et al. "Spatially Guided Training for Vision-Language-Action Model." International Conference on Learning Representations, 2026.

Markdown

[Ye et al. "Spatially Guided Training for Vision-Language-Action Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ye2026iclr-spatially/)

BibTeX

@inproceedings{ye2026iclr-spatially,
  title     = {{Spatially Guided Training for Vision-Language-Action Model}},
  author    = {Ye, Jinhui and Wang, Fangjing and Gao, Ning and Yu, Junqiu and Yangkun, Zhu and Wang, Bin and Zhang, Jinyu and Jin, Weiyang and Fu, Yanwei and Zheng, Feng and Chen, Yilun and Pang, Jiangmiao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ye2026iclr-spatially/}
}