Per-Pixel Classification Is Not All You Need for Semantic Segmentation

Abstract

Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

Cite

Text

Cheng et al. "Per-Pixel Classification Is Not All You Need for Semantic Segmentation." Neural Information Processing Systems, 2021.

Markdown

[Cheng et al. "Per-Pixel Classification Is Not All You Need for Semantic Segmentation." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/cheng2021neurips-perpixel/)

BibTeX

@inproceedings{cheng2021neurips-perpixel,
  title     = {{Per-Pixel Classification Is Not All You Need for Semantic Segmentation}},
  author    = {Cheng, Bowen and Schwing, Alex and Kirillov, Alexander},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/cheng2021neurips-perpixel/}
}