A Simple Framework for Text-Supervised Semantic Segmentation

Abstract

Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pre-training (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly ameliorate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg.

Cite

Text

Yi et al. "A Simple Framework for Text-Supervised Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00683

Markdown

[Yi et al. "A Simple Framework for Text-Supervised Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yi2023cvpr-simple/) doi:10.1109/CVPR52729.2023.00683

BibTeX

@inproceedings{yi2023cvpr-simple,
  title     = {{A Simple Framework for Text-Supervised Semantic Segmentation}},
  author    = {Yi, Muyang and Cui, Quan and Wu, Hao and Yang, Cheng and Yoshie, Osamu and Lu, Hongtao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {7071-7080},
  doi       = {10.1109/CVPR52729.2023.00683},
  url       = {https://mlanthology.org/cvpr/2023/yi2023cvpr-simple/}
}