StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth

Abstract

Recent advances in monocular depth estimation significantly improve robustness and accuracy. However, relative depth models exhibit flickering and 3D inconsistency in video data, limiting 3D reconstruction applications. We introduce StableDepth, a scene-consistent and scale-invariant depth estimation method achieving scene-level 3D consistency. Our dual-decoder architecture learns from large-scale unlabeled video data, enhancing generalization and reducing flickering. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13x faster speed, achieving significant improvements across benchmarks with comparable temporal consistency to video diffusion-based estimators.

Cite

Text

Zhang et al. "StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth." International Conference on Computer Vision, 2025.

Markdown

[Zhang et al. "StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/zhang2025iccv-stabledepth/)

BibTeX

@inproceedings{zhang2025iccv-stabledepth,
  title     = {{StableDepth: Scene-Consistent and Scale-Invariant Monocular Depth}},
  author    = {Zhang, Zheng and Yang, Lihe and Yang, Tianyu and Yu, Chaohui and Guo, Xiaoyang and Lao, Yixing and Zhao, Hengshuang},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {7069-7078},
  url       = {https://mlanthology.org/iccv/2025/zhang2025iccv-stabledepth/}
}