Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors
Abstract
We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data. Instead of relying on large-scale annotated video datasets, we demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints. Our zero-shot training strategy combines state-of-the-art image estimation models based on optical flow smoothness through a hybrid loss function, implemented via a lightweight temporal attention architecture. Applied to leading image models like Depth Anything V2 and Marigold-E2E-FT, our approach significantly improves temporal consistency while maintaining accuracy. Experiments show that our method not only outperforms image-based approaches but also achieves results comparable to state-of-the-art video models trained on large-scale paired video datasets, despite using no such paired video data.
Cite
Text
Kuang et al. "Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01645Markdown
[Kuang et al. "Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/kuang2025cvpr-buffer/) doi:10.1109/CVPR52734.2025.01645BibTeX
@inproceedings{kuang2025cvpr-buffer,
title = {{Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors}},
author = {Kuang, Zhengfei and Zhang, Tianyuan and Zhang, Kai and Tan, Hao and Bi, Sai and Hu, Yiwei and Xu, Zexiang and Hasan, Milos and Wetzstein, Gordon and Luan, Fujun},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {17660-17670},
doi = {10.1109/CVPR52734.2025.01645},
url = {https://mlanthology.org/cvpr/2025/kuang2025cvpr-buffer/}
}