Manipulate-Anything: Automating Real-World Robots Using Vision-Language Models

Abstract

Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies. We believe Manipulate-Anything can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Anonymous project page: manipulate-anything.github.io.

Cite

Text

Duan et al. "Manipulate-Anything: Automating Real-World Robots Using Vision-Language Models." Proceedings of The 8th Conference on Robot Learning, 2024.

Markdown

[Duan et al. "Manipulate-Anything: Automating Real-World Robots Using Vision-Language Models." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/duan2024corl-manipulateanything/)

BibTeX

@inproceedings{duan2024corl-manipulateanything,
  title     = {{Manipulate-Anything: Automating Real-World Robots Using Vision-Language Models}},
  author    = {Duan, Jiafei and Yuan, Wentao and Pumacay, Wilbert and Wang, Yi Ru and Ehsani, Kiana and Fox, Dieter and Krishna, Ranjay},
  booktitle = {Proceedings of The 8th Conference on Robot Learning},
  year      = {2024},
  pages     = {5326-5350},
  volume    = {270},
  url       = {https://mlanthology.org/corl/2024/duan2024corl-manipulateanything/}
}