ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition

Abstract

Most existing 3D visual speech animation methods synthesize lip movements synchronized with speech, which however neglect head poses and therefore degrade the animation realism. The animation of head poses presents two primary challenges: (1) the intricate mapping between speech and head poses remains poorly understood and (2) the absence of 4D face datasets featuring realistic head poses. Inspired by prosody decomposition in speech processing, we discern that head movements correlate with the fundamental frequency (F0) of speech prosody, while lip movements align with the language content. These observations motivate us to propose a novel framework, dubbed ProsodyTalker, that concurrently synthesizes lip and head movements, grounded in the principles of prosody decomposition. The core idea is first to adopt information perturbation to explicitly decompose the speech prosody into pose-related F0 and lip-related language content. Then, an autoregressive content-oriented fusion decoder is employed to enhance lip synchronization in the synthesized facial sequences. To synthesize head poses, we design a transformer-based variational autoencoder to learn a latent distribution of facial sequences and propose an F0-conditioned latent diffusion model to establish a probabilistic mapping from F0 to pose-related latent codes. Furthermore, we contribute a large-scale 4D face dataset containing bunches of variations in identities, head poses and facial motions. Extensive experiments show that our method achieves more realistic animation than state-of-the-art methods.

Cite

Text

Li et al. "ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I5.32542

Markdown

[Li et al. "ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/li2025aaai-prosodytalker/) doi:10.1609/AAAI.V39I5.32542

BibTeX

@inproceedings{li2025aaai-prosodytalker,
  title     = {{ProsodyTalker: 3D Visual Speech Animation via Prosody Decomposition}},
  author    = {Li, Zonglin and Lv, Xiaoqian and Liu, Qinglin and Meng, Quanling and Sun, Xin and Zhang, Shengping},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {5110-5118},
  doi       = {10.1609/AAAI.V39I5.32542},
  url       = {https://mlanthology.org/aaai/2025/li2025aaai-prosodytalker/}
}