ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Abstract

This paper presents , the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. is built upon an improved 3D encoder by extending [?] to that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing as the 3D point cloud input encoder for LLMs, is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. and achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.

Cite

Text

Qi et al. "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72775-7_13

Markdown

[Qi et al. "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/qi2024eccv-shapellm/) doi:10.1007/978-3-031-72775-7_13

BibTeX

@inproceedings{qi2024eccv-shapellm,
  title     = {{ShapeLLM: Universal 3D Object Understanding for Embodied Interaction}},
  author    = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72775-7_13},
  url       = {https://mlanthology.org/eccv/2024/qi2024eccv-shapellm/}
}