Audio-Driven Stylized Gesture Generation with Flow-Based Model

Abstract

Generating stylized audio-driven gestures for robots and virtual avatars has attracted increasing considerations recently. Existing methods require style labels (e.g. speaker identities), or complex preprocessing of the data to obtain style control parameters. In this paper, we propose a new end-to-end flow-based model, which can generate audio-driven gestures of arbitrary styles without the preprocessing procedure and style labels. To achieve this goal, we introduce a global encoder and a gesture perceptual loss into the classic generative flow model to capture both the global and local information. We conduct extensive experiments on two benchmark datasets: the TED Dataset and the Trinity Dataset. Both quantitative and qualitative evaluations show that the proposed model outperforms state-of-the-art models.

Cite

Text

Ye et al. "Audio-Driven Stylized Gesture Generation with Flow-Based Model." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-20065-6_41

Markdown

[Ye et al. "Audio-Driven Stylized Gesture Generation with Flow-Based Model." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/ye2022eccv-audiodriven/) doi:10.1007/978-3-031-20065-6_41

BibTeX

@inproceedings{ye2022eccv-audiodriven,
  title     = {{Audio-Driven Stylized Gesture Generation with Flow-Based Model}},
  author    = {Ye, Sheng and Wen, Yu-Hui and Sun, Yanan and He, Ying and Zhang, Ziyang and Wang, Yaoyuan and He, Weihua and Liu, Yong-Jin},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-20065-6_41},
  url       = {https://mlanthology.org/eccv/2022/ye2022eccv-audiodriven/}
}