Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance

Zhang, Yang; Wang, Chenwei; Lu, Ouyang; Zhao, Yuan; Ge, Yunfei; Sun, Zhenglong; Li, Xiu; Zhang, Chi; Bai, Chenjia; Li, Xuelong

Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li

ICLR 2026

/iclr/2026/zhang2026iclr-alignthensteer/

Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks. Our code is released at \url{https://github.com/TeleHuman/Align-Then-Steer}.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-alignthensteer/)

BibTeX

@inproceedings{zhang2026iclr-alignthensteer,
  title     = {{Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance}},
  author    = {Zhang, Yang and Wang, Chenwei and Lu, Ouyang and Zhao, Yuan and Ge, Yunfei and Sun, Zhenglong and Li, Xiu and Zhang, Chi and Bai, Chenjia and Li, Xuelong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-alignthensteer/}
}