Co-Speech Gesture Video Generation with 3D Human Meshes
Abstract
Co-speech gesture video generation is an enabling technique for many digital human applications. Substantial progress has been made in creating high-quality talking head videos. However, existing hand gesture video generation methods are primarily limited by the widely adopted 2D skeleton-based gesture representation and still struggle to generate realistic hands. We introduce an audio-driven co-speech video generation pipeline to synthesize human speech videos leveraging 3D human mesh-based representations. By adopting a 3D human mesh-based gesture representation, we present a mesh-grounded video generator that includes a mesh texture map optimization step followed by a conditional GAN network and outputs photorealistic gesture videos with realistic hands. Our experiments on the TalkSHOW dataset demonstrate the effectiveness of our method over 2D skeleton-based baselines.
Cite
Text
Mahapatra et al. "Co-Speech Gesture Video Generation with 3D Human Meshes." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73024-5_11Markdown
[Mahapatra et al. "Co-Speech Gesture Video Generation with 3D Human Meshes." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/mahapatra2024eccv-cospeech/) doi:10.1007/978-3-031-73024-5_11BibTeX
@inproceedings{mahapatra2024eccv-cospeech,
title = {{Co-Speech Gesture Video Generation with 3D Human Meshes}},
author = {Mahapatra, Aniruddha and Mishra, Richa and Chen, Ziyi and Ding, Boyang and Li, Renda and Wang, Shoulei and Zhu, Jun-Yan and Chang, Peng and Han, Mei and Xiao, Jing},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73024-5_11},
url = {https://mlanthology.org/eccv/2024/mahapatra2024eccv-cospeech/}
}