MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model
Abstract
Human motion generation involves synthesizing coherent human motion sequences conditioned on diverse multimodal inputs and holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts. To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance. MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation. Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks. Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators.
Cite
Text
Cao et al. "MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model." International Conference on Computer Vision, 2025.Markdown
[Cao et al. "MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cao2025iccv-motionctrl/)BibTeX
@inproceedings{cao2025iccv-motionctrl,
title = {{MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model}},
author = {Cao, Bin and Zheng, Sipeng and Wang, Ye and Xia, Lujie and Wei, Qianshan and Jin, Qin and Liu, Jing and Lu, Zongqing},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {12253-12262},
url = {https://mlanthology.org/iccv/2025/cao2025iccv-motionctrl/}
}