HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction
Abstract
While flourishing developments have been witnessed in text-to-motion generation, synthesizing physically realistic, controllable, language-conditioned Human Scene Interactions (HSI) remains a relatively underexplored landscape. Current HSI methods naively rely on conditional Variational AutoEncoder (cVAE) and diffusion models. They are typically associated with limited modalities of control signals and task-specific frameworks design, leading to inflexible adaptation across various interaction scenarios and descriptive-unfaithful motions in diverse 3D physical environments. In this paper, we propose HSI-GPT, a General-Purpose Large Scene-Motion-Language Model that applies "next-token prediction" paradigm of Large Language Models to the HSI domain. HSI-GPT not only exhibits remarkable flexibility to accommodate diverse control signals (3D scenes, textual commands, key-frame poses, as well as scene affordances), but it seamlessly supports various HSI-related tasks (e.g., multi-modal controlled HSI generation, HSI understanding, and general motion completion in 3D scenes). First, HSI-GPT quantizes textual descriptions and human motions into discrete, LLM-interpretable tokens with multi-modal tokenizers. Inspired by multi-modal learning, we develop a recipe for aligning mixed-modality tokens into the shared embedding space of LLMs. These interaction tokens are then organized into unified instruction following prompts, allowing HSI-GPT to fine-tune on question-and-answer tasks. Extensive experiments and visualizations validate that our general-purpose HSI-GPT model delivers exceptional performance across multiple HSI-related tasks.
Cite
Text
Wang et al. "HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.00670Markdown
[Wang et al. "HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-hsigpt/) doi:10.1109/CVPR52734.2025.00670BibTeX
@inproceedings{wang2025cvpr-hsigpt,
title = {{HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction}},
author = {Wang, Yuan and Li, Yali and Li, Xiang and Wang, Shengjin},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {7147-7157},
doi = {10.1109/CVPR52734.2025.00670},
url = {https://mlanthology.org/cvpr/2025/wang2025cvpr-hsigpt/}
}