VIMA: Robot Manipulation with Multimodal Prompts

Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, Linxi Fan

ICML 2023 pp. 14975-15022

/icml/2023/jiang2023icml-vima/

Abstract

Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io

PDF ICML OpenReview Semantic Scholar

Cite

Text

Jiang et al. "VIMA: Robot Manipulation with Multimodal Prompts." International Conference on Machine Learning, 2023.

Markdown

[Jiang et al. "VIMA: Robot Manipulation with Multimodal Prompts." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/jiang2023icml-vima/)

BibTeX

@inproceedings{jiang2023icml-vima,
  title     = {{VIMA: Robot Manipulation with Multimodal Prompts}},
  author    = {Jiang, Yunfan and Gupta, Agrim and Zhang, Zichen and Wang, Guanzhi and Dou, Yongqiang and Chen, Yanjun and Fei-Fei, Li and Anandkumar, Anima and Zhu, Yuke and Fan, Linxi},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {14975-15022},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/jiang2023icml-vima/}
}