Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

Abstract

We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.

Cite

Text

Jiang et al. "Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction." International Conference on Computer Vision, 2025.

Markdown

[Jiang et al. "Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/jiang2025iccv-geo4d/)

BibTeX

@inproceedings{jiang2025iccv-geo4d,
  title     = {{Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction}},
  author    = {Jiang, Zeren and Zheng, Chuanxia and Laina, Iro and Larlus, Diane and Vedaldi, Andrea},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {20658-20671},
  url       = {https://mlanthology.org/iccv/2025/jiang2025iccv-geo4d/}
}