UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent

Abstract

Recent advancements in Vision-Language-Action (VLA) models have leveraged pre-trained Vision-Language Models (VLMs) to improve the generalization capabilities. VLMs, typically pre-trained on vision-language understanding tasks, provide rich semantic knowledge and reasoning abilities. However, prior research has shown that VLMs often focus on high-level semantic content and neglect low-level features, limiting their ability to capture detailed spatial information and understand physical dynamics. These aspects, which are crucial for embodied control tasks, remain underexplored in existing pre-training paradigms. In this paper, we investigate the training paradigm for VLAs, and introduce UP-VLA, a Unified VLA model training with both multi-modal Understanding and future Prediction objectives, enhancing both high-level semantic comprehension and low-level spatial understanding. Experimental results show that UP-VLA achieves a 33% improvement on the Calvin ABC-D benchmark compared to the previous state-of-the-art method. Additionally, UP-VLA demonstrates improved success rates in real-world manipulation tasks, particularly those requiring precise spatial information.

Cite

Text

Zhang et al. "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhang et al. "UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhang2025icml-upvla/)

BibTeX

@inproceedings{zhang2025icml-upvla,
  title     = {{UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent}},
  author    = {Zhang, Jianke and Guo, Yanjiang and Hu, Yucheng and Chen, Xiaoyu and Zhu, Xiang and Chen, Jianyu},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {74911-74922},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhang2025icml-upvla/}
}