Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Abstract

In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.

Cite

Text

Ahmed et al. "Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description." International Conference on Computer Vision, 2025.

Markdown

[Ahmed et al. "Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ahmed2025iccv-kestrel/)

BibTeX

@inproceedings{ahmed2025iccv-kestrel,
  title     = {{Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description}},
  author    = {Ahmed, Mahmoud and Fei, Junjie and Ding, Jian and Bakr, Eslam Mohamed and Elhoseiny, Mohamed},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {8973-8983},
  url       = {https://mlanthology.org/iccv/2025/ahmed2025iccv-kestrel/}
}