NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
Abstract
Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications. While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge. Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of Video Diffusion Models (VDMs). We identify the reason for blurry predictions when directly applying VDMs and introduce Semantic Feature Regularization (SFR) to encourage the model to concentrate on geometric details by aligning diffusion features with fine-grained semantic cues. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context. Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos. Code and models are publicly available at https://normalcrafter.github.io/.
Cite
Text
Bin et al. "NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors." International Conference on Computer Vision, 2025.Markdown
[Bin et al. "NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/bin2025iccv-normalcrafter/)BibTeX
@inproceedings{bin2025iccv-normalcrafter,
title = {{NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors}},
author = {Bin, Yanrui and Hu, Wenbo and Wang, Haoyuan and Chen, Xinya and Wang, Bing},
booktitle = {International Conference on Computer Vision},
year = {2025},
pages = {8330-8339},
url = {https://mlanthology.org/iccv/2025/bin2025iccv-normalcrafter/}
}