Learning Representations from Foundation Models for Domain Generalized Stereo Matching

Abstract

State-of-the-art stereo matching networks trained on in-domain data often underperform on cross-domain scenes. Intuitively, leveraging the zero-shot capacity of a foundation model can alleviate the cross-domain generalization problem. The main challenge of incorporating a foundation model into stereo matching pipeline lies in the absence of an effective forward process from single-view coarse-grained tokens to cross-view fine-grained cost representations. In this paper, we propose FormerStereo, a general framework that integrates the Vision Transformer (ViT) based foundation model into the stereo matching pipeline. Using this framework, we transfer the all-purpose features to matching-specific ones. Specifically, we propose a reconstruction-constrained decoder to retrieve fine-grained representations from coarse-grained ViT tokens. To maintain cross-view consistent representations, we propose a cosine-constrained concatenation cost (C4) space to construct cost volumes. We integrate FormerStereo with state-of-the-art (SOTA) stereo matching networks and evaluate its effectiveness on multiple benchmark datasets. Experiments show that the FormerStereo framework effectively improves the zero-shot performance of existing stereo matching networks on unseen domains and achieves SOTA performance.

Cite

Text

Zhang et al. "Learning Representations from Foundation Models for Domain Generalized Stereo Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72946-1_9

Markdown

[Zhang et al. "Learning Representations from Foundation Models for Domain Generalized Stereo Matching." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-learning-b/) doi:10.1007/978-3-031-72946-1_9

BibTeX

@inproceedings{zhang2024eccv-learning-b,
  title     = {{Learning Representations from Foundation Models for Domain Generalized Stereo Matching}},
  author    = {Zhang, Yongjian and Wang, Longguang and Li, Kunhong and Yun, Wang and Guo, Yulan},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72946-1_9},
  url       = {https://mlanthology.org/eccv/2024/zhang2024eccv-learning-b/}
}