Co-Scale Conv-Attentional Image Transformers

Abstract

In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.

Cite

Text

Xu et al. "Co-Scale Conv-Attentional Image Transformers." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00983

Markdown

[Xu et al. "Co-Scale Conv-Attentional Image Transformers." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/xu2021iccv-coscale/) doi:10.1109/ICCV48922.2021.00983

BibTeX

@inproceedings{xu2021iccv-coscale,
  title     = {{Co-Scale Conv-Attentional Image Transformers}},
  author    = {Xu, Weijian and Xu, Yifan and Chang, Tyler and Tu, Zhuowen},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {9981-9990},
  doi       = {10.1109/ICCV48922.2021.00983},
  url       = {https://mlanthology.org/iccv/2021/xu2021iccv-coscale/}
}