MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation

Abstract

The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frameworks for modeling single basic emotional expressions, which restricts the generation of complex emotions such as compound emotions; b) the lack of comprehensive datasets rich in human emotional expressions, which limits the potential of models. To address these challenges, we propose the following innovations: 1) the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental emotions to enable the precise synthesis of both singular and compound emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to include six prevalent human emotional expressions as well as four types of compound emotions, thereby expanding the training potential of emotion-driven models; 3) an emotion-to-latents module that leverages multimodal inputs, aligning diverse control signals--such as audio, text, and labels--to enhance audio-driven emotion control. Through extensive quantitative and qualitative evaluations, we demonstrate that the MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in generating complex emotional expressions and nuanced facial details, setting a new benchmark in the field. These datasets will be publicly released.

Cite

Text

Liu et al. "MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02442

Markdown

[Liu et al. "MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/liu2025cvpr-moee/) doi:10.1109/CVPR52734.2025.02442

BibTeX

@inproceedings{liu2025cvpr-moee,
  title     = {{MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation}},
  author    = {Liu, Huaize and Sun, Wenzhang and Di, Donglin and Sun, Shibo and Yang, Jiahui and Zou, Changqing and Bao, Hujun},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {26222-26231},
  doi       = {10.1109/CVPR52734.2025.02442},
  url       = {https://mlanthology.org/cvpr/2025/liu2025cvpr-moee/}
}