OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Lin, Fanqi; Nai, Ruiqian; Hu, Yingdong; You, Jiacheng; Zhao, Junming; Gao, Yang

OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning

Fanqi Lin, Ruiqian Nai, Yingdong Hu, Jiacheng You, Junming Zhao, Yang Gao

ICLR 2026

/iclr/2026/lin2026iclr-onetwovla/

Abstract

General-purpose robots capable of performing diverse tasks require synergistic reasoning and acting capabilities. However, recent dual-system approaches, which separate high-level reasoning from low-level acting, often suffer from challenges such as limited mutual understanding of capabilities between systems and latency issues. This paper introduces OneTwoVLA, a single unified vision-language-action model that can perform both acting (System One) and reasoning (System Two). Crucially, OneTwoVLA adaptively switches between two modes: explicitly reasoning at critical moments during task execution, and generating actions based on the most recent reasoning at other times. To further unlock OneTwoVLA's reasoning and generalization capabilities, we design a scalable pipeline for synthesizing embodied reasoning-centric vision-language data, used for co-training with robot data. We validate OneTwoVLA's effectiveness through extensive experiments, highlighting its superior performance across four key capabilities: long-horizon task planning, error detection and recovery, natural human-robot interaction, and generalizable visual grounding, enabling the model to perform long-horizon, highly dexterous manipulation tasks such as making hotpot or mixing cocktails.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Lin et al. "OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning." International Conference on Learning Representations, 2026.

Markdown

[Lin et al. "OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/lin2026iclr-onetwovla/)

BibTeX

@inproceedings{lin2026iclr-onetwovla,
  title     = {{OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning}},
  author    = {Lin, Fanqi and Nai, Ruiqian and Hu, Yingdong and You, Jiacheng and Zhao, Junming and Gao, Yang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/lin2026iclr-onetwovla/}
}