Stereo Depth Estimation with Echoes

Abstract

Stereo depth estimation is particularly amenable to local textured regions while echoes have good depth estimations for global textureless regions, thus the two modalities complement each other. Motivated by the reciprocal relationship between both modalities, in this paper, we propose an end-to-end framework named StereoEchoes for stereo depth estimation with echoes. A Cross-modal Volume Refinement module is designed to transfer the complementary knowledge of the audio modality to the visual modality at feature level. A Relative Depth Uncertainty Estimation module is further proposed to yield pixel-wise confidence for multimodal depth fusion at output space. As there is no dataset for this new problem, we introduce two Stereo-Echo datasets named Stereo-Replica and Stereo-Matterport3D for the first time. Remarkably, we show empirically that our StereoEchoes, on Stereo-Replica and Stereo-Matterport3D, outperforms stereo depth estimation methods by 25%/13.8% RMSE, and surpasses the state-of-the-art audio-visual depth prediction method by 25.3%/42.3% RMSE.

Cite

Text

Zhang et al. "Stereo Depth Estimation with Echoes." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19812-0_29

Markdown

[Zhang et al. "Stereo Depth Estimation with Echoes." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/zhang2022eccv-stereo/) doi:10.1007/978-3-031-19812-0_29

BibTeX

@inproceedings{zhang2022eccv-stereo,
  title     = {{Stereo Depth Estimation with Echoes}},
  author    = {Zhang, Chenghao and Tian, Kun and Ni, Bolin and Meng, Gaofeng and Fan, Bin and Zhang, Zhaoxiang and Pan, Chunhong},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19812-0_29},
  url       = {https://mlanthology.org/eccv/2022/zhang2022eccv-stereo/}
}