TEMOS: Generating Diverse Human Motions from Textual Descriptions
Abstract
We address the problem of generating diverse 3D human motions from textual descriptions. This challenging task requires joint modeling of both modalities: understanding and extracting useful human-centric information from the text, and then generating plausible and realistic sequences of human poses. In contrast to most previous work which focuses on generating a single, deterministic, motion from a textual description, we design a variational approach that can produce multiple diverse human motions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data, in combination with a text encoder that produces distribution parameters compatible with the VAE latent space. We show the TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions. We evaluate our approach on the KIT Motion-Language benchmark and, despite being relatively straightforward, demonstrate significant improvements over the state of the art. Code and models are available on our webpage.
Cite
Text
Petrovich et al. "TEMOS: Generating Diverse Human Motions from Textual Descriptions." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20047-2_28Markdown
[Petrovich et al. "TEMOS: Generating Diverse Human Motions from Textual Descriptions." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/petrovich2022eccv-temos/) doi:10.1007/978-3-031-20047-2_28BibTeX
@inproceedings{petrovich2022eccv-temos,
title = {{TEMOS: Generating Diverse Human Motions from Textual Descriptions}},
author = {Petrovich, Mathis and Black, Michael J. and Varol, Gül},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2022},
doi = {10.1007/978-3-031-20047-2_28},
url = {https://mlanthology.org/eccv/2022/petrovich2022eccv-temos/}
}