LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Li, Xiang; Mata, Cristina; Park, Jongwoo; Kahatapitiya, Kumara; Jang, Yoo Sung; Shang, Jinghuan; Ranasinghe, Kanchana; Burgert, Ryan D; Cai, Mu; Lee, Yong Jae; Ryoo, Michael S

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Sung Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan D Burgert, Mu Cai, Yong Jae Lee, Michael S Ryoo

ICLR 2025

/iclr/2025/li2025iclr-llara/

Abstract

Vision Language Models (VLMs) have recently been leveraged to generate robotic actions, forming Vision-Language-Action (VLA) models. However, directly adapting a pretrained VLM for robotic control remains challenging, particularly when constrained by a limited number of robot demonstrations. In this work, we introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations and enables an efficient transfer of a pretrained VLM into a powerful VLA, motivated by the success of visual instruction tuning in Computer Vision. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets, aligning robotic actions with image pixel coordinates. Further, we enhance this dataset in a self-supervised manner by defining six auxiliary tasks, without requiring any additional action annotations. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control. Through experiments across multiple simulated and real-world tasks, we demonstrate that LLaRA achieves state-of-the-art performance while preserving the generalization capabilities of large language models. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.

PDF ICLR Semantic Scholar

Cite

Text

Li et al. "LLaRA: Supercharging Robot Learning Data for Vision-Language Policy." International Conference on Learning Representations, 2025.

Markdown

[Li et al. "LLaRA: Supercharging Robot Learning Data for Vision-Language Policy." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/li2025iclr-llara/)

BibTeX

@inproceedings{li2025iclr-llara,
  title     = {{LLaRA: Supercharging Robot Learning Data for Vision-Language Policy}},
  author    = {Li, Xiang and Mata, Cristina and Park, Jongwoo and Kahatapitiya, Kumara and Jang, Yoo Sung and Shang, Jinghuan and Ranasinghe, Kanchana and Burgert, Ryan D and Cai, Mu and Lee, Yong Jae and Ryoo, Michael S},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/li2025iclr-llara/}
}