T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

Abstract

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

Cite

Text

Lee et al. "T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00193

Markdown

[Lee et al. "T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/lee2024cvprw-t2lm/) doi:10.1109/CVPRW63382.2024.00193

BibTeX

@inproceedings{lee2024cvprw-t2lm,
  title     = {{T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences}},
  author    = {Lee, Taeryung and Baradel, Fabien and Lucas, Thomas and Lee, Kyoung Mu and Rogez, Grégory},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {1867-1876},
  doi       = {10.1109/CVPRW63382.2024.00193},
  url       = {https://mlanthology.org/cvprw/2024/lee2024cvprw-t2lm/}
}