HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction

Yuan Wang, Yali Li, Xiang Li, Shengjin Wang

CVPR 2025 pp. 7147-7157

doi:10.1109/CVPR52734.2025.00670 /cvpr/2025/wang2025cvpr-hsigpt/

Abstract

While flourishing developments have been witnessed in text-to-motion generation, synthesizing physically realistic, controllable, language-conditioned Human Scene Interactions (HSI) remains a relatively underexplored landscape. Current HSI methods naively rely on conditional Variational AutoEncoder (cVAE) and diffusion models. They are typically associated with limited modalities of control signals and task-specific frameworks design, leading to inflexible adaptation across various interaction scenarios and descriptive-unfaithful motions in diverse 3D physical environments. In this paper, we propose HSI-GPT, a General-Purpose Large Scene-Motion-Language Model that applies "next-token prediction" paradigm of Large Language Models to the HSI domain. HSI-GPT not only exhibits remarkable flexibility to accommodate diverse control signals (3D scenes, textual commands, key-frame poses, as well as scene affordances), but it seamlessly supports various HSI-related tasks (e.g., multi-modal controlled HSI generation, HSI understanding, and general motion completion in 3D scenes). First, HSI-GPT quantizes textual descriptions and human motions into discrete, LLM-interpretable tokens with multi-modal tokenizers. Inspired by multi-modal learning, we develop a recipe for aligning mixed-modality tokens into the shared embedding space of LLMs. These interaction tokens are then organized into unified instruction following prompts, allowing HSI-GPT to fine-tune on question-and-answer tasks. Extensive experiments and visualizations validate that our general-purpose HSI-GPT model delivers exceptional performance across multiple HSI-related tasks.

PDF CVPR Semantic Scholar

Cite

Text

Wang et al. "HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00670

Markdown

[Wang et al. "HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-hsigpt/) doi:10.1109/CVPR52734.2025.00670

BibTeX

@inproceedings{wang2025cvpr-hsigpt,
  title     = {{HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction}},
  author    = {Wang, Yuan and Li, Yali and Li, Xiang and Wang, Shengjin},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {7147-7157},
  doi       = {10.1109/CVPR52734.2025.00670},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-hsigpt/}
}