Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs

CVPR 2023 pp. 11165-11174

doi:10.1109/CVPR52729.2023.01074 /cvpr/2023/cha2023cvpr-learning/

Abstract

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

PDF CVPR Semantic Scholar

Cite

Text

Cha et al. "Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01074

Markdown

[Cha et al. "Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/cha2023cvpr-learning/) doi:10.1109/CVPR52729.2023.01074

BibTeX

@inproceedings{cha2023cvpr-learning,
  title     = {{Learning to Generate Text-Grounded Mask for Open-World Semantic Segmentation from Only Image-Text Pairs}},
  author    = {Cha, Junbum and Mun, Jonghwan and Roh, Byungseok},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {11165-11174},
  doi       = {10.1109/CVPR52729.2023.01074},
  url       = {https://mlanthology.org/cvpr/2023/cha2023cvpr-learning/}
}