Generating Human Motion from Textual Descriptions with Discrete Representations

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation. Our implementation is available on the project page: https://mael-zys.github.io/T2M-GPT/

Cite

Text

Zhang et al. "Generating Human Motion from Textual Descriptions with Discrete Representations." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01415

Markdown

[Zhang et al. "Generating Human Motion from Textual Descriptions with Discrete Representations." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/zhang2023cvpr-generating/) doi:10.1109/CVPR52729.2023.01415

BibTeX

@inproceedings{zhang2023cvpr-generating,
  title     = {{Generating Human Motion from Textual Descriptions with Discrete Representations}},
  author    = {Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi and Shan, Ying},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {14730-14740},
  doi       = {10.1109/CVPR52729.2023.01415},
  url       = {https://mlanthology.org/cvpr/2023/zhang2023cvpr-generating/}
}