DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads

Abstract

In this work, we investigate the generation of high-fidelity, audio-driven 3D Gaussian talking heads from monocular videos. We present DGTalker, an innovative framework designed for real-time, high-fidelity, and 3D-aware talking head synthesis. By leveraging Gaussian generative priors and treating the task as a latent space navigation problem, our method effectively alleviates the lack of 3D information and the low-quality detail reconstruction caused by the absence of structure priors in monocular videos, which is a longstanding challenge in existing 3DGS-based approaches. To ensure precise lip synchronization and nuanced expression control, we propose a disentangled latent space navigation method that independently models lip motion and talking expressions. Additionally, we introduce an effective masked cross-view supervision strategy to enable robust learning within the disentangled framework. We conduct extensive experiments and demonstrate that DGTalker surpasses current state-of-the-art methods in visual quality, motion accuracy, and controllability.

Cite

Text

Liang et al. "DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads." International Conference on Computer Vision, 2025.

Markdown

[Liang et al. "DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liang2025iccv-dgtalker/)

BibTeX

@inproceedings{liang2025iccv-dgtalker,
  title     = {{DGTalker: Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads}},
  author    = {Liang, Xiaoxi and Fan, Yanbo and Yang, Qiya and Wang, Xuan and Gao, Wei and Li, Ge},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {11079-11088},
  url       = {https://mlanthology.org/iccv/2025/liang2025iccv-dgtalker/}
}