Neural Voice Puppetry: Audio-Driven Facial Reenactment
Abstract
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study.
Cite
Text
Thies et al. "Neural Voice Puppetry: Audio-Driven Facial Reenactment." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58517-4_42Markdown
[Thies et al. "Neural Voice Puppetry: Audio-Driven Facial Reenactment." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/thies2020eccv-neural/) doi:10.1007/978-3-030-58517-4_42BibTeX
@inproceedings{thies2020eccv-neural,
title = {{Neural Voice Puppetry: Audio-Driven Facial Reenactment}},
author = {Thies, Justus and Elgharib, Mohamed and Tewari, Ayush and Theobalt, Christian and Nießner, Matthias},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2020},
doi = {10.1007/978-3-030-58517-4_42},
url = {https://mlanthology.org/eccv/2020/thies2020eccv-neural/}
}