HIS-GPT: Towards 3D Human-in-Scene Multimodal Understanding

Abstract

We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research of human behavior analysis in 3D scenes, advancing embodied AI and world models.

Cite

Text

Zhao et al. "HIS-GPT: Towards 3D Human-in-Scene Multimodal Understanding." International Conference on Computer Vision, 2025.

Markdown

[Zhao et al. "HIS-GPT: Towards 3D Human-in-Scene Multimodal Understanding." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhao2025iccv-hisgpt/)

BibTeX

@inproceedings{zhao2025iccv-hisgpt,
  title     = {{HIS-GPT: Towards 3D Human-in-Scene Multimodal Understanding}},
  author    = {Zhao, Jiahe and Hou, Ruibing and Tian, Zejie and Chang, Hong and Shan, Shiguang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {4317-4327},
  url       = {https://mlanthology.org/iccv/2025/zhao2025iccv-hisgpt/}
}