FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
Abstract
In this paper, we propose a novel video depth estimation approach, , which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multi-frame correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multi-frame feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as well as a final refinement network. Through extensive experiments on several benchmarks, i.e., NYUDv2, KITTI, DDAD, and Sintel, which cover indoor, driving, and open-domain scenarios, we show that significantly improves upon baseline models, outperforms existing video depth estimation methods, and sets new state-of-the-art (SOTA) accuracy. Furthermore, is more efficient than existing SOTA video depth estimation models and has similar latencies when comparing to monocular models.
Cite
Text
Yasarla et al. "FutureDepth: Learning to Predict the Future Improves Video Depth Estimation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72652-1_26Markdown
[Yasarla et al. "FutureDepth: Learning to Predict the Future Improves Video Depth Estimation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/yasarla2024eccv-futuredepth/) doi:10.1007/978-3-031-72652-1_26BibTeX
@inproceedings{yasarla2024eccv-futuredepth,
title = {{FutureDepth: Learning to Predict the Future Improves Video Depth Estimation}},
author = {Yasarla, Rajeev and Singh, Manish Kumar and Cai, Hong and Shi, Yunxiao and Jeong, Jisoo and Zhu, Yinhao and Han, Shizhong and Garrepalli, Risheek and Porikli, Fatih},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72652-1_26},
url = {https://mlanthology.org/eccv/2024/yasarla2024eccv-futuredepth/}
}