GenM3: Generative Pretrained Multi-Path Motion Model for Text Conditional Human Motion Generation
Abstract
Scaling up motion datasets is crucial to enhance motion generation capabilities. However, training on large-scale multi-source datasets introduces data heterogeneity challenges due to variations in motion content. To address this, we propose Generative Pretrained Multi-path Motion Model (GenM^3), a comprehensive framework designed to learn unified motion representations. GenM^3 comprises two components: 1) a Multi-Expert VQ-VAE (MEVQ-VAE) that adapts to different dataset distributions to learn a unified discrete motion representation, and 2) a Multi-path Motion Transformer (MMT) that improves intra-modal representations by using separate modality-specific pathways, each with densely activated experts to accommodate variations within that modality, and improves inter-modal alignment by the text-motion shared pathway. To enable large-scale training, we integrate and unify 11 high-quality motion datasets (approximately 220 hours of motion data) and augment it with textual annotations (nearly 10,000 motion sequences labeled by a large language model and 300+ by human experts). After training on our integrated dataset, GenM^3 achieves a state-of-the-art FID of 0.035 on the HumanML3D benchmark, surpassing state-of-the-art methods by a large margin. It also demonstrates strong zero-shot generalization on IDEA400 dataset, highlighting its effectiveness and adaptability across diverse motion scenarios.
Cite
Text
Shi et al. "GenM3: Generative Pretrained Multi-Path Motion Model for Text Conditional Human Motion Generation." International Conference on Computer Vision, 2025.Markdown
[Shi et al. "GenM3: Generative Pretrained Multi-Path Motion Model for Text Conditional Human Motion Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/shi2025iccv-genm3/)BibTeX
@inproceedings{shi2025iccv-genm3,
title = {{GenM3: Generative Pretrained Multi-Path Motion Model for Text Conditional Human Motion Generation}},
author = {Shi, Junyu and Liu, Lijiang and Sun, Yong and Zhang, Zhiyuan and Zhou, Jinni and Nie, Qiang},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {13129-13139},
url = {https://mlanthology.org/iccv/2025/shi2025iccv-genm3/}
}