FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-Wise Decoding

Liu, Yicheng; Zhang, Shiduo; Dong, Zibin; Ye, Baijun; Yuan, Tianyuan; Yu, Xiaopeng; Yin, Linqi; Lu, Chenhao; Shi, Junhao; Yu, Luca Jiang-Tao; Zheng, Liangtao; Gong, Jingjing; Jiang, Tao; Qiu, Xipeng; Zhao, Hang

FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-Wise Decoding

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Jingjing Gong, Tao Jiang, Xipeng Qiu, Hang Zhao

ICLR 2026

/iclr/2026/liu2026iclr-faster/

Abstract

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce \textbf{FASTer}, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Liu et al. "FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-Wise Decoding." International Conference on Learning Representations, 2026.

Markdown

[Liu et al. "FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-Wise Decoding." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/liu2026iclr-faster/)

BibTeX

@inproceedings{liu2026iclr-faster,
  title     = {{FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-Wise Decoding}},
  author    = {Liu, Yicheng and Zhang, Shiduo and Dong, Zibin and Ye, Baijun and Yuan, Tianyuan and Yu, Xiaopeng and Yin, Linqi and Lu, Chenhao and Shi, Junhao and Yu, Luca Jiang-Tao and Zheng, Liangtao and Gong, Jingjing and Jiang, Tao and Qiu, Xipeng and Zhao, Hang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/liu2026iclr-faster/}
}