Instance-Level Video Depth in Groups Beyond Occlusions

Abstract

Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our two-stage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instance-wise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks.

Cite

Text

Liang et al. "Instance-Level Video Depth in Groups Beyond Occlusions." International Conference on Computer Vision, 2025.

Markdown

[Liang et al. "Instance-Level Video Depth in Groups Beyond Occlusions." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/liang2025iccv-instancelevel/)

BibTeX

@inproceedings{liang2025iccv-instancelevel,
  title     = {{Instance-Level Video Depth in Groups Beyond Occlusions}},
  author    = {Liang, Yuan and Zhou, Yang and Sun, Ziming and Xiang, Tianyi and Li, Guiqing and He, Shengfeng},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {7581-7591},
  url       = {https://mlanthology.org/iccv/2025/liang2025iccv-instancelevel/}
}