A Transformer-Based Decoder for Semantic Segmentation with Multi-Level Context Mining

Abstract

Transformers have recently shown superior performance than CNN on semantic segmentation. However, previous works mostly focus on the deliberate design of the encoder, while seldom considering the decoder part. In this paper, we find that a light weighted decoder counts for segmentation, and propose a pure transformer-based segmentation decoder, named SegDeformer, to seamlessly incorporate into current varied transformer-based encoders. The highlight is that SegDeformer is able to conveniently utilize the tokenized input and the attention mechanism of the transformer for effective context mining. This is achieved by two key component designs, i.e., the internal and external context mining modules. The former is equipped with internal attention within an image to better capture global-local context, while the latter introduces external tokens from other images to enhance current representation. To enable SegDeformer in a scalable way, we further provide performance/efficiency optimization modules for flexible deployment. Experiments on widely used benchmarks ADE20K, COCO-Stuff and Cityscapes and different transformer encoders (e.g., ViT, MiT and Swin) demonstrate that SegDeformer can bring consistent performance gains.

Cite

Text

Shi et al. "A Transformer-Based Decoder for Semantic Segmentation with Multi-Level Context Mining." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19815-1_36

Markdown

[Shi et al. "A Transformer-Based Decoder for Semantic Segmentation with Multi-Level Context Mining." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/shi2022eccv-transformerbased/) doi:10.1007/978-3-031-19815-1_36

BibTeX

@inproceedings{shi2022eccv-transformerbased,
  title     = {{A Transformer-Based Decoder for Semantic Segmentation with Multi-Level Context Mining}},
  author    = {Shi, Bowen and Jiang, Dongsheng and Zhang, Xiaopeng and Li, Han and Dai, Wenrui and Zou, Junni and Xiong, Hongkai and Tian, Qi},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19815-1_36},
  url       = {https://mlanthology.org/eccv/2022/shi2022eccv-transformerbased/}
}