Any2Policy: Learning Visuomotor Policy with Any-Modality

Abstract

Humans can communicate and observe media with different modalities, such as texts, sounds, and images. For robots to be more generalizable embodied agents, they should be capable of following instructions and perceiving the world with adaptation to diverse modalities. Current robotic learning methodologies often focus on single-modal task specification and observation, thereby limiting their ability to process rich multi-modal information. Addressing this limitation, we present an end-to-end general-purpose multi-modal system named Any-to-Policy Embodied Agents. This system empowers robots to handle tasks using various modalities, whether in combinations like text-image, audio-image, text-point cloud, or in isolation. Our innovative approach involves training a versatile modality network that adapts to various inputs and connects with policy networks for effective control. Because of the lack of existing multi-modal robotics datasets for evaluation, we assembled a comprehensive real-world dataset encompassing 30 robotic tasks. Each task in this dataset is richly annotated across multiple modalities, providing a robust foundation for assessment. We conducted extensive validation of our proposed unified modality embodied agent using several simulation benchmarks, including Franka Kitchen, Meta-World, and Maniskill2, as well as in our real-world settings. Our experiments showcase the promising capability of building embodied agents that can adapt to diverse multi-modal in a unified framework.

Cite

Text

Zhu et al. "Any2Policy: Learning Visuomotor Policy with Any-Modality." Neural Information Processing Systems, 2024. doi:10.52202/079017-4244

Markdown

[Zhu et al. "Any2Policy: Learning Visuomotor Policy with Any-Modality." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhu2024neurips-any2policy/) doi:10.52202/079017-4244

BibTeX

@inproceedings{zhu2024neurips-any2policy,
  title     = {{Any2Policy: Learning Visuomotor Policy with Any-Modality}},
  author    = {Zhu, Yichen and Ou, Zhicai and Feng, Feifei and Tang, Jian},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4244},
  url       = {https://mlanthology.org/neurips/2024/zhu2024neurips-any2policy/}
}