Revisit Self-Supervision with Local Structure-from-Motion
Abstract
Both self-supervised depth estimation and Structure-from-Motion (SfM) recover scene depth from RGB videos. Despite sharing a similar objective, the two approaches are disconnected. Prior works of self-supervision backpropagate losses defined within immediate neighboring frames. Instead of learning-through-loss, this work proposes an alternative scheme by performing local SfM. First, with calibrated RGB or RGB-D images, we employ a depth and correspondence estimator to infer depthmaps and pair-wise correspondence maps. Then, a novel bundle-RANSAC-adjustment algorithm jointly optimizes camera poses and one depth adjustment for each depthmap. Finally, we fix camera poses and employ a NeRF, however, without a neural network, for dense triangulation and geometric verification. Poses, depth adjustments, and triangulated sparse depths are our outputs. For the first time, we show self-supervision within 5 frames already benefits SoTA supervised depth and correspondence models. Despite self-supervision, our pose algorithm has certified global optimality, outperforming optimization-based, learning-based, and NeRF-based prior arts. The project page is held in the link.
Cite
Text
Zhu and Liu. "Revisit Self-Supervision with Local Structure-from-Motion." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73007-8_3Markdown
[Zhu and Liu. "Revisit Self-Supervision with Local Structure-from-Motion." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhu2024eccv-revisit/) doi:10.1007/978-3-031-73007-8_3BibTeX
@inproceedings{zhu2024eccv-revisit,
title = {{Revisit Self-Supervision with Local Structure-from-Motion}},
author = {Zhu, Shengjie and Liu, Xiaoming},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73007-8_3},
url = {https://mlanthology.org/eccv/2024/zhu2024eccv-revisit/}
}