Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Zhou, Zhenglin; Ma, Fan; Fan, Hehe; Chua, Tat-Seng

doi:10.1109/CVPR52734.2025.01486

Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Zhenglin Zhou, Fan Ma, Hehe Fan, Tat-Seng Chua

CVPR 2025 pp. 15941-15952

doi:10.1109/CVPR52734.2025.01486 /cvpr/2025/zhou2025cvpr-zero1toa/

Abstract

Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.

PDF CVPR Semantic Scholar

Cite

Text

Zhou et al. "Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01486

Markdown

[Zhou et al. "Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhou2025cvpr-zero1toa/) doi:10.1109/CVPR52734.2025.01486

BibTeX

@inproceedings{zhou2025cvpr-zero1toa,
  title     = {{Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion}},
  author    = {Zhou, Zhenglin and Ma, Fan and Fan, Hehe and Chua, Tat-Seng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {15941-15952},
  doi       = {10.1109/CVPR52734.2025.01486},
  url       = {https://mlanthology.org/cvpr/2025/zhou2025cvpr-zero1toa/}
}