Self-Supervised Surround-View Depth Estimation with Volumetric Feature Fusion

Abstract

We present a self-supervised depth estimation approach using a unified volumetric feature fusion for surround-view images. Given a set of surround-view images, our method constructs a volumetric feature map by extracting image feature maps from surround-view images and fuse the feature maps into a shared, unified 3D voxel space. The volumetric feature map then can be used for estimating a depth map at each surround view by projecting it into an image coordinate. A volumetric feature contains 3D information at its local voxel coordinate; thus our method can also synthesize a depth map at arbitrary rotated viewpoints by projecting the volumetric feature map into the target viewpoints. Furthermore, assuming static camera extrinsics in the multi-camera system, we propose to estimate a canonical camera motion from the volumetric feature map. Our method leverages 3D spatio- temporal context to learn metric-scale depth and the canonical camera motion in a self-supervised manner. Our method outperforms the prior arts on DDAD and nuScenes datasets, especially estimating more accurate metric-scale depth and consistent depth between neighboring views.

Cite

Text

Kim et al. "Self-Supervised Surround-View Depth Estimation with Volumetric Feature Fusion." Neural Information Processing Systems, 2022.

Markdown

[Kim et al. "Self-Supervised Surround-View Depth Estimation with Volumetric Feature Fusion." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/kim2022neurips-selfsupervised/)

BibTeX

@inproceedings{kim2022neurips-selfsupervised,
  title     = {{Self-Supervised Surround-View Depth Estimation with Volumetric Feature Fusion}},
  author    = {Kim, Jung-Hee and Hur, Junhwa and Nguyen, Tien Phuoc and Jeong, Seong-Gyun},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/kim2022neurips-selfsupervised/}
}