ShapeLLM: Universal 3D Object Understanding for Embodied Interaction
Abstract
This paper presents , the first 3D Multimodal Large Language Model (LLM) designed for embodied interaction, exploring a universal 3D object understanding with 3D point clouds and languages. is built upon an improved 3D encoder by extending [?] to that benefits from multi-view image distillation for enhanced geometry understanding. By utilizing as the 3D point cloud input encoder for LLMs, is trained on constructed instruction-following data and tested on our newly human-curated benchmark, 3D MM-Vet. and achieve state-of-the-art performance in 3D geometry understanding and language-unified 3D interaction tasks, such as embodied visual grounding.
Cite
Text
Qi et al. "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72775-7_13Markdown
[Qi et al. "ShapeLLM: Universal 3D Object Understanding for Embodied Interaction." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/qi2024eccv-shapellm/) doi:10.1007/978-3-031-72775-7_13BibTeX
@inproceedings{qi2024eccv-shapellm,
title = {{ShapeLLM: Universal 3D Object Understanding for Embodied Interaction}},
author = {Qi, Zekun and Dong, Runpei and Zhang, Shaochen and Geng, Haoran and Han, Chunrui and Ge, Zheng and Yi, Li and Ma, Kaisheng},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72775-7_13},
url = {https://mlanthology.org/eccv/2024/qi2024eccv-shapellm/}
}