GAS: Generative Avatar Synthesis from a Single Image
Abstract
We present a unified and generalizable framework for synthesizing view-consistent and temporally coherent avatars from a single image, addressing the challenging task of single-image avatar generation. Existing diffusion-based methods often condition on sparse human templates (e.g., depth or normal maps), which leads to multi-view and temporal inconsistencies due to the mismatch between these signals and the true appearance of the subject. Our approach bridges this gap by combining the reconstruction power of regression-based 3D human reconstruction with the generative capabilities of a diffusion model. In a first step, an initial 3D reconstructed human through a generalized NeRF provides comprehensive conditioning, ensuring high-quality synthesis faithful to the reference appearance and structure. Subsequently, the derived geometry and appearance from the generalized NeRF serve as input to a video-based diffusion model. This strategic integration is pivotal for enforcing both multi-view and temporal consistency throughout the avatar's generation. Empirical results underscore the superior generalization ability of our proposed method, demonstrating its effectiveness across diverse in-domain and out-of-domain in-the-wild datasets.
Cite
Text
Lu et al. "GAS: Generative Avatar Synthesis from a Single Image." International Conference on Computer Vision, 2025.Markdown
[Lu et al. "GAS: Generative Avatar Synthesis from a Single Image." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/lu2025iccv-gas/)BibTeX
@inproceedings{lu2025iccv-gas,
title = {{GAS: Generative Avatar Synthesis from a Single Image}},
author = {Lu, Yixing and Dong, Junting and Kwon, Youngjoong and Zhao, Qin and Dai, Bo and De la Torre, Fernando},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {12883-12893},
url = {https://mlanthology.org/iccv/2025/lu2025iccv-gas/}
}