SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

Abstract

Existing research of 3D LLMs still struggles to achieve efficient and explainable reasoning, primarily due to the under-exploration of the mechanism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a Chain-of-Thought reasoning framework in 3D scenes (SceneCOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a framework, we build the first large-scale 3D scene Chain-of-Thought reasoning dataset, SceneCOT, including more than 190k high-quality data instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves state-of-the-art with clear interpretability. To our knowledge, this is the first attempt to successfully implement the COT technique for achieving human-like step-by-step reasoning for 3D scene understanding, where we show great potential in extending it to a wider range of 3D scene understanding scenarios.

Cite

Text

Linghu et al. "SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes." International Conference on Learning Representations, 2026.

Markdown

[Linghu et al. "SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/linghu2026iclr-scenecot/)

BibTeX

@inproceedings{linghu2026iclr-scenecot,
  title     = {{SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes}},
  author    = {Linghu, Xiongkun and Huang, Jiangyong and Zhu, Ziyu and Jia, Baoxiong and Huang, Siyuan},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/linghu2026iclr-scenecot/}
}