An Embodied Generalist Agent in 3D World

Abstract

Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world. We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Through extensive experiments, we demonstrate LEO’s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.

Cite

Text

Huang et al. "An Embodied Generalist Agent in 3D World." ICML 2024 Workshops: MFM-EAI, 2024.

Markdown

[Huang et al. "An Embodied Generalist Agent in 3D World." ICML 2024 Workshops: MFM-EAI, 2024.](https://mlanthology.org/icmlw/2024/huang2024icmlw-embodied/)

BibTeX

@inproceedings{huang2024icmlw-embodied,
  title     = {{An Embodied Generalist Agent in 3D World}},
  author    = {Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
  booktitle = {ICML 2024 Workshops: MFM-EAI},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/huang2024icmlw-embodied/}
}