X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Zheng, Jinliang; Li, Jianxiong; Wang, Zhihao; Liu, Dongxiu; Kang, Xirui; Feng, Yuchun; Zheng, Yinan; Zou, Jiayin; Chen, Yilun; Zeng, Jia; Wang, Tai; Zhang, Ya-Qin; Liu, Jingjing; Zhan, Xianyuan

X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model

Jinliang Zheng, Jianxiong Li, Zhihao Wang, Dongxiu Liu, Xirui Kang, Yuchun Feng, Yinan Zheng, Jiayin Zou, Yilun Chen, Jia Zeng, Tai Wang, Ya-Qin Zhang, Jingjing Liu, Xianyuan Zhan

ICLR 2026

/iclr/2026/zheng2026iclr-xvla/

Abstract

Successful generalist Vision-Language-Action (VLA) models that rely on effective training across diverse robotic platforms with large-scale, cross-embodiment, heterogeneous datasets. To facilitate and leverage the heterogeneity in rich, diverse robotic data sources, we propose a novel Soft Prompt approach with minimally added parameters, by infusing prompt learning concepts into cross-embodiment robot learning and introducing separate sets of learnable embeddings for each distinct data source. These embeddings serve as embodiment-specific prompts, which in unity empower VLA models with effective exploitation of varying cross-embodiment features. Our new X-VLA, a neat flow-matching-based VLA architecture, relies exclusively on soft-prompted standard Transformer encoders with an enhanced encoding pipeline, enjoying both scalability and simplicity. Evaluated across 6 simulation environments as well as 3 real-world robotics platforms, our 0.9B instantiation-X-VLA-0.9B simultaneously achieves state-of-the-art performance over a sweep of benchmark suites, demonstrating superior results on a wide axes of capabilities, from flexible dexterity to quick adaptation across embodiments, environments, and tasks.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zheng et al. "X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model." International Conference on Learning Representations, 2026.

Markdown

[Zheng et al. "X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zheng2026iclr-xvla/)

BibTeX

@inproceedings{zheng2026iclr-xvla,
  title     = {{X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model}},
  author    = {Zheng, Jinliang and Li, Jianxiong and Wang, Zhihao and Liu, Dongxiu and Kang, Xirui and Feng, Yuchun and Zheng, Yinan and Zou, Jiayin and Chen, Yilun and Zeng, Jia and Wang, Tai and Zhang, Ya-Qin and Liu, Jingjing and Zhan, Xianyuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zheng2026iclr-xvla/}
}