EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation

Abstract

Speech-driven 3D facial animation has attracted significant attention due to its wide range of applications in animation production and virtual reality. Recent research has explored speech-emotion disentanglement to enhance facial expressions rather than manually assigning emotions. However, this approach face issues such as feature confusion, emotions weakening and mean-face. To address these issues, we present EcoFace, a framework that (1) proposes a novel collaboration objective to provide a explicit signal for emotion representation learning from the speaker's expressive movements and produced sounds, constructing an audio-visual joint and coordinated emotion space that is independent of speech content. (2) constructs a universal facial motion distribution space determined by speech features and implement speaker-specific generation. Extensive experiments show that our method achieves more generalized and emotionally realistic talking face generation compared to previous methods.

Cite

Text

Xie et al. "EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation." International Conference on Learning Representations, 2025.

Markdown

[Xie et al. "EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/xie2025iclr-ecoface/)

BibTeX

@inproceedings{xie2025iclr-ecoface,
  title     = {{EcoFace: Audio-Visual Emotional Co-Disentanglement Speech-Driven 3D Talking Face Generation}},
  author    = {Xie, Jiajian and Zhang, Shengyu and Li, Mengze and Lv, Chengfei and Zhao, Zhou and Wu, Fei},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/xie2025iclr-ecoface/}
}