Learning to Recover 3D Scene Shape from a Single Image

Abstract

Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at:https://git.io/Depth.

Cite

Text

Yin et al. "Learning to Recover 3D Scene Shape from a Single Image." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00027

Markdown

[Yin et al. "Learning to Recover 3D Scene Shape from a Single Image." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/yin2021cvpr-learning-a/) doi:10.1109/CVPR46437.2021.00027

BibTeX

@inproceedings{yin2021cvpr-learning-a,
  title     = {{Learning to Recover 3D Scene Shape from a Single Image}},
  author    = {Yin, Wei and Zhang, Jianming and Wang, Oliver and Niklaus, Simon and Mai, Long and Chen, Simon and Shen, Chunhua},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2021},
  pages     = {204-213},
  doi       = {10.1109/CVPR46437.2021.00027},
  url       = {https://mlanthology.org/cvpr/2021/yin2021cvpr-learning-a/}
}