Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Yang, Guanglei; Tang, Hao; Ding, Mingli; Sebe, Nicu; Ricci, Elisa

doi:10.1109/ICCV48922.2021.01596

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, Elisa Ricci

ICCV 2021 pp. 16269-16279

doi:10.1109/ICCV48922.2021.01596 /iccv/2021/yang2021iccv-transformerbased/

Abstract

While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-the-art performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/TransDepth.

PDF ICCV Semantic Scholar

Cite

Text

Yang et al. "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.01596

Markdown

[Yang et al. "Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yang2021iccv-transformerbased/) doi:10.1109/ICCV48922.2021.01596

BibTeX

@inproceedings{yang2021iccv-transformerbased,
  title     = {{Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction}},
  author    = {Yang, Guanglei and Tang, Hao and Ding, Mingli and Sebe, Nicu and Ricci, Elisa},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {16269-16279},
  doi       = {10.1109/ICCV48922.2021.01596},
  url       = {https://mlanthology.org/iccv/2021/yang2021iccv-transformerbased/}
}