Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion
Abstract
Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.
Cite
Text
Zhou et al. "Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01486Markdown
[Zhou et al. "Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/zhou2025cvpr-zero1toa/) doi:10.1109/CVPR52734.2025.01486BibTeX
@inproceedings{zhou2025cvpr-zero1toa,
title = {{Zero-1-to-a: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion}},
author = {Zhou, Zhenglin and Ma, Fan and Fan, Hehe and Chua, Tat-Seng},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {15941-15952},
doi = {10.1109/CVPR52734.2025.01486},
url = {https://mlanthology.org/cvpr/2025/zhou2025cvpr-zero1toa/}
}