All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation

Abstract

With the rise of generative models, multi-modal video generation has gained significant attention, particularly in the realm of audio-driven emotional talking face synthesis. This paper addresses two key challenges in this domain: Input bias and intensity saturation. A novel neutralization scheme is first proposed to counter input bias, yielding impressive results in generating neutral talking faces from emotionally expressive ones. Furthermore, 2D continuous emotion label-based regression learning effectively generates varying emotional intensities on a frame basis. Results from a user study quantify subjective interpretations of strong emotions and naturalness, revealing up to 78.09% higher emotion accuracy and up to 3.41 higher naturalness score compared to the lowest-ranked method. https://github.com/sbde500/EAP

Cite

Text

Kim and Song. "All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73039-9_20

Markdown

[Kim and Song. "All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/kim2024eccv-all/) doi:10.1007/978-3-031-73039-9_20

BibTeX

@inproceedings{kim2024eccv-all,
  title     = {{All You Need Is Your Voice: Emotional Face Representation with Audio Perspective for Emotional Talking Face Generation}},
  author    = {Kim, Seongho and Song, Byung Cheol},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73039-9_20},
  url       = {https://mlanthology.org/eccv/2024/kim2024eccv-all/}
}