All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation
Abstract
With the rise of generative models, multi-modal video generation has gained significant attention, particularly in the realm of audio-driven emotional talking face synthesis. This paper addresses two key challenges in this domain: Input bias and intensity saturation. A novel neutralization scheme is first proposed to counter input bias, yielding impressive results in generating neutral talking faces from emotionally expressive ones. Furthermore, 2D continuous emotion label-based regression learning effectively generates varying emotional intensities on a frame basis. Results from a user study quantify subjective interpretations of strong emotions and naturalness, revealing up to 78.09% higher emotion accuracy and up to 3.41 higher naturalness score compared to the lowest-ranked method. https://github.com/sbde500/EAP
Cite
Text
Kim and Song. "All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73039-9_20Markdown
[Kim and Song. "All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/kim2024eccv-all/) doi:10.1007/978-3-031-73039-9_20BibTeX
@inproceedings{kim2024eccv-all,
title = {{All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation}},
author = {Kim, Seongho and Song, Byung Cheol},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73039-9_20},
url = {https://mlanthology.org/eccv/2024/kim2024eccv-all/}
}