MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model

Abstract

Human motion generation involves synthesizing coherent human motion sequences conditioned on diverse multimodal inputs and holds significant potential for real-world applications. Despite recent advancements, existing vision-language-motion models (VLMMs) remain limited in achieving this goal. In this paper, we identify the lack of controllability as a critical bottleneck, where VLMMs struggle with diverse human commands, pose initialization, generation of long-term or unseen cases, and fine-grained control over individual body parts. To address these challenges, we introduce MotionCtrl, the first real-time, controllable VLMM with state-of-the-art performance. MotionCtrl achieves its controllability through training on HuMo100M, the largest human motion dataset to date, featuring over 5 million self-collected motions, 100 million multi-task instructional instances, and detailed part-level descriptions that address a long-standing gap in the field. Additionally, we propose a novel part-aware residual quantization technique for motion tokenization, enabling precise control over individual body parts during motion generation. Extensive experiments demonstrate MotionCtrl's superior performance across a wide range of motion benchmarks. Furthermore, we provide strategic design insights and a detailed time efficiency analysis to guide the development of practical motion generators.

Cite

Text

Cao et al. "MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model." International Conference on Computer Vision, 2025.

Markdown

[Cao et al. "MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/cao2025iccv-motionctrl/)

BibTeX

@inproceedings{cao2025iccv-motionctrl,
  title     = {{MotionCtrl: A Real-Time Controllable Vision-Language-Motion Model}},
  author    = {Cao, Bin and Zheng, Sipeng and Wang, Ye and Xia, Lujie and Wei, Qianshan and Jin, Qin and Liu, Jing and Lu, Zongqing},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {12253-12262},
  url       = {https://mlanthology.org/iccv/2025/cao2025iccv-motionctrl/}
}