Context-Enhanced Stereo Transformer

Abstract

Stereo depth estimation is of great interest for computer vision research. However, existing methods struggles to generalize and predict reliably in hazardous regions, such as large uniform regions. To overcome these limitations, we propose Context Enhanced Path (CEP). CEP improves the generalization and robustness against common failure cases in existing solutions by capturing the long-range global information. We construct our stereo depth estimation model, Context Enhanced Stereo Transformer (CEST), by plugging CEP into the state-of-the-art stereo depth estimation method Stereo Transformer. CEST is examined on distinct public datasets, such as Scene Flow, Middlebury-2014, KITTI-2015, and MPI-Sintel. We find CEST outperforms prior approaches by a large margin. For example, in the zero-shot synthetic-to-real setting, CEST outperforms the best competing approaches on Middlebury-2014 dataset by 11%. Our extensive experiments demonstrate that the long-range information is critical for stereo matching task and CEP successfully captures such information.

Cite

Text

Guo et al. "Context-Enhanced Stereo Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19824-3_16

Markdown

[Guo et al. "Context-Enhanced Stereo Transformer." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/guo2022eccv-contextenhanced/) doi:10.1007/978-3-031-19824-3_16

BibTeX

@inproceedings{guo2022eccv-contextenhanced,
  title     = {{Context-Enhanced Stereo Transformer}},
  author    = {Guo, Weiyu and Li, Zhaoshuo and Yang, Yongkui and Wang, Zheng and Taylor, Russell H. and Unberath, Mathias and Yuille, Alan and Li, Yingwei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19824-3_16},
  url       = {https://mlanthology.org/eccv/2022/guo2022eccv-contextenhanced/}
}