GeometryCrafter: Consistent Geometry Estimation for Open-World Videos with Diffusion Priors

Abstract

Despite remarkable advancements in video depth estimation, existing methods fall short in geometric fidelity due to their affine-invariant predictions, restricting their applicability in reconstruction and other metrically grounded downstream tasks. We propose a novel point map Variational Autoencoder (VAE) for encoding and decoding unbounded point maps. Notably, its latent space is agnostic to video latent distributions of video diffusion models, allowing us to leverage generation priors to model the distribution of point map sequences conditioned on the input videos. Thus, we can recover high-fidelity point map sequences with temporal coherence from open-world videos, facilitating accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. Extensive evaluations on diverse datasets demonstrate that our method achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

Cite

Text

Xu et al. "GeometryCrafter: Consistent Geometry Estimation for Open-World Videos with Diffusion Priors." International Conference on Computer Vision, 2025.

Markdown

[Xu et al. "GeometryCrafter: Consistent Geometry Estimation for Open-World Videos with Diffusion Priors." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/xu2025iccv-geometrycrafter/)

BibTeX

@inproceedings{xu2025iccv-geometrycrafter,
  title     = {{GeometryCrafter: Consistent Geometry Estimation for Open-World Videos with Diffusion Priors}},
  author    = {Xu, Tian-Xing and Gao, Xiangjun and Hu, Wenbo and Li, Xiaoyu and Zhang, Song-Hai and Shan, Ying},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {6632-6644},
  url       = {https://mlanthology.org/iccv/2025/xu2025iccv-geometrycrafter/}
}