Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Xiao, Wenli; Lin, Haotian; Peng, Andy; Xue, Haoru; He, Tairan; Luo, Zhengyi; Xie, Yuqi; Hu, Fengyuan; Fan, Linxi; Shi, Guanya; Zhu, Yuke

Self-Improving Vision-Language-Action Models with Data Generation via Residual RL

Wenli Xiao, Haotian Lin, Andy Peng, Haoru Xue, Tairan He, Zhengyi Luo, Yuqi Xie, Fengyuan Hu, Linxi Fan, Guanya Shi, Yuke Zhu

ICLR 2026

/iclr/2026/xiao2026iclr-selfimproving/

Abstract

Supervised fine-tuning (SFT) has become the de facto post-training strategy for large vision-language-action (VLA) models, but its reliance on costly human demonstrations limits scalability and generalization. We propose Probe, Learn, Distill (PLD), a plug-and-play framework that improves VLAs through residual reinforcement learning and distribution-aware data collection. In Stage 1 (specialist acquisition), we freeze the VLA backbone and train lightweight residual actors via off-policy RL. These specialists take over in states where the base policy fails, thereby probing failure regions of the generalist. In Stage 2 (data collection), we employ a hybrid rollout scheme that biases residual interventions toward states frequently visited by the base policy, aligning collected trajectories with the generalist’s deployment distribution while capturing recovery behaviors. In Stage 3 (fine-tuning), these curated trajectories are distilled back into the generalist with standard SFT, applicable to both flow-matching and autoregressive heads. We evaluate PLD across diverse settings: it achieves a near-saturated 99% task success rate on the LIBERO benchmark, delivers over 50% performance gains in SimplerEnv, and demonstrates practicality on real-world Franka arm manipulation tasks. We further provide ablations showing that residual policy probing and distribution-aware replay are key to collecting deployment-aligned data that improves VLAs’ capabilities on both seen and unseen tasks. Our results demonstrate that RL-generated, policy-aligned data can surpass teleoperation-only demonstrations, offering a scalable path toward self-improving VLA models.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Xiao et al. "Self-Improving Vision-Language-Action Models with Data Generation via Residual RL." International Conference on Learning Representations, 2026.

Markdown

[Xiao et al. "Self-Improving Vision-Language-Action Models with Data Generation via Residual RL." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/xiao2026iclr-selfimproving/)

BibTeX

@inproceedings{xiao2026iclr-selfimproving,
  title     = {{Self-Improving Vision-Language-Action Models with Data Generation via Residual RL}},
  author    = {Xiao, Wenli and Lin, Haotian and Peng, Andy and Xue, Haoru and He, Tairan and Luo, Zhengyi and Xie, Yuqi and Hu, Fengyuan and Fan, Linxi and Shi, Guanya and Zhu, Yuke},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/xiao2026iclr-selfimproving/}
}