AnyTalk: Multi-Modal Driven Multi-Domain Talking Head Generation

Abstract

Cross-domain talking head generation, such as animating a static cartoon animal photo with real human video, is crucial for personalized content creation. However, prior works typically rely on domain-specific frameworks and paired videos, limiting its utility and complicating its architecture with additional motion alignment modules. Addressing these shortcomings, we propose Anytalk, a unified framework that eliminates the need for paired data and learns a shared motion representation across different domains. The motion is represented by canonical 3D keypoints extracted using an unsupervised 3D keypoint detector. Further, we propose an expression consistency loss to improve the accuracy of facial dynamics in video generation. Additionally, we present AniTalk, a comprehensive dataset designed for advanced multi-modal cross-domain generation. Our experiments demonstrate that Anytalk excels at generating high-quality, multi-modal talking head videos, showcasing remarkable generalization capabilities across diverse domains.

Cite

Text

Wang et al. "AnyTalk: Multi-Modal Driven Multi-Domain Talking Head Generation." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I8.32874

Markdown

[Wang et al. "AnyTalk: Multi-Modal Driven Multi-Domain Talking Head Generation." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/wang2025aaai-anytalk/) doi:10.1609/AAAI.V39I8.32874

BibTeX

@inproceedings{wang2025aaai-anytalk,
  title     = {{AnyTalk: Multi-Modal Driven Multi-Domain Talking Head Generation}},
  author    = {Wang, Yu and Liu, Yunfei and Hong, Fa-Ting and Cao, Meng and Lin, Lijian and Li, Yu},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {8105-8113},
  doi       = {10.1609/AAAI.V39I8.32874},
  url       = {https://mlanthology.org/aaai/2025/wang2025aaai-anytalk/}
}