AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Abstract

Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose AttT2M, a two-stage method with multi-perspective attention mechanism: body-part attention and global-local motion-text attention. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code will be publicly available.

Cite

Text

Zhong et al. "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00053

Markdown

[Zhong et al. "AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/zhong2023iccv-attt2m/) doi:10.1109/ICCV51070.2023.00053

BibTeX

@inproceedings{zhong2023iccv-attt2m,
  title     = {{AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism}},
  author    = {Zhong, Chongyang and Hu, Lei and Zhang, Zihao and Xia, Shihong},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {509-519},
  doi       = {10.1109/ICCV51070.2023.00053},
  url       = {https://mlanthology.org/iccv/2023/zhong2023iccv-attt2m/}
}