Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Yang, Ganlin; Zhang, Tianyi; Hao, Haoran; Wang, Weiyun; Liu, Yibin; Wang, Dehui; Chen, Guanzhou; Cai, Zijian; Chen, Junting; Su, Weijie; Zhou, Wengang; Qiao, Yu; Dai, Jifeng; Pang, Jiangmiao; Luo, Gen; Wang, Wenhai; Mu, Yao; Hou, Zhi

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou

ICLR 2026

/iclr/2026/yang2026iclr-vlaser/

Abstract

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing **Vlaser** - a **V**ision-**L**anguage-**A**ction Model with **s**ynergistic **e**mbodied **r**easoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark. We will open-source the model weights, data generation pipelines, and the full dataset to support future research.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yang et al. "Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Yang et al. "Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yang2026iclr-vlaser/)

BibTeX

@inproceedings{yang2026iclr-vlaser,
  title     = {{Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning}},
  author    = {Yang, Ganlin and Zhang, Tianyi and Hao, Haoran and Wang, Weiyun and Liu, Yibin and Wang, Dehui and Chen, Guanzhou and Cai, Zijian and Chen, Junting and Su, Weijie and Zhou, Wengang and Qiao, Yu and Dai, Jifeng and Pang, Jiangmiao and Luo, Gen and Wang, Wenhai and Mu, Yao and Hou, Zhi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yang2026iclr-vlaser/}
}