An Embodied Generalist Agent in 3D World
Abstract
Leveraging massive knowledge from large language models (LLMs), recent machine learning models show notable successes in general-purpose task solving in diverse domains such as computer vision and robotics. However, several significant challenges remain: (i) most of these models rely on 2D images yet exhibit a limited capacity for 3D input; (ii) these models rarely explore the tasks inherently defined in 3D world. We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world. LEO is trained with a unified task interface, model architecture, and objective in two stages: (i) 3D vision-language (VL) alignment and (ii) 3D vision-language-action (VLA) instruction tuning. We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world. Through extensive experiments, we demonstrate LEO’s remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.
Cite
Text
Huang et al. "An Embodied Generalist Agent in 3D World." ICML 2024 Workshops: MFM-EAI, 2024.Markdown
[Huang et al. "An Embodied Generalist Agent in 3D World." ICML 2024 Workshops: MFM-EAI, 2024.](https://mlanthology.org/icmlw/2024/huang2024icmlw-embodied/)BibTeX
@inproceedings{huang2024icmlw-embodied,
title = {{An Embodied Generalist Agent in 3D World}},
author = {Huang, Jiangyong and Yong, Silong and Ma, Xiaojian and Linghu, Xiongkun and Li, Puhao and Wang, Yan and Li, Qing and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan},
booktitle = {ICML 2024 Workshops: MFM-EAI},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/huang2024icmlw-embodied/}
}