WT-MVSNet: Window-Based Transformers for Multi-View Stereo

Abstract

Recently, Transformers have been shown to enhance the performance of multi-view stereo by enabling long-range feature interaction. In this work, we propose Window-based Transformers (WT) for local feature matching and global feature aggregation in multi-view stereo. We introduce a Window-based Epipolar Transformer (WET) which reduces matching redundancy by using epipolar constraints. Since point-to-line matching is sensitive to erroneous camera pose and calibration, we match windows near the epipolar lines. A second Shifted WT is employed for aggregating global information within cost volume. We present a novel Cost Transformer (CT) to replace 3D convolutions for cost volume regularization. In order to better constrain the estimated depth maps from multiple views, we further design a novel geometric consistency loss (Geo Loss) which punishes unreliable areas where multi-view consistency is not satisfied. Our WT multi-view stereo method (WT-MVSNet) achieves state-of-the-art performance across multiple datasets and ranks $1^{st}$ on Tanks and Temples benchmark. Code will be available upon acceptance.

Cite

Text

Liao et al. "WT-MVSNet: Window-Based Transformers for Multi-View Stereo." Neural Information Processing Systems, 2022.

Markdown

[Liao et al. "WT-MVSNet: Window-Based Transformers for Multi-View Stereo." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/liao2022neurips-wtmvsnet/)

BibTeX

@inproceedings{liao2022neurips-wtmvsnet,
  title     = {{WT-MVSNet: Window-Based Transformers for Multi-View Stereo}},
  author    = {Liao, Jinli and Ding, Yikang and Shavit, Yoli and Huang, Dihe and Ren, Shihao and Guo, Jia and Feng, Wensen and Zhang, Kai},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/liao2022neurips-wtmvsnet/}
}