Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance
Abstract
Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce Align-Then-stEer (ATE), a novel, data-efficient, and plug-and-play adaptation framework. ATE first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to 9.8% in simulation and achieves a striking 32% success rate gain in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks. Our code is released at \url{https://github.com/TeleHuman/Align-Then-Steer}.
Cite
Text
Zhang et al. "Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance." International Conference on Learning Representations, 2026.Markdown
[Zhang et al. "Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-alignthensteer/)BibTeX
@inproceedings{zhang2026iclr-alignthensteer,
title = {{Align-Then-stEer: Adapting the Vision-Language Action Models Through Unified Latent Guidance}},
author = {Zhang, Yang and Wang, Chenwei and Lu, Ouyang and Zhao, Yuan and Ge, Yunfei and Sun, Zhenglong and Li, Xiu and Zhang, Chi and Bai, Chenjia and Li, Xuelong},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/zhang2026iclr-alignthensteer/}
}