Language-Driven Semantic Segmentation
Abstract
We present LSeg, a novel model for language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., ``grass'' or ``building'') together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., ``cat'' and ``furry''). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We demonstrate that our approach achieves highly competitive zero-shot performance compared to existing zero- and few-shot semantic segmentation methods, and even matches the accuracy of traditional segmentation algorithms when a fixed label set is provided. Code and demo are available at https://github.com/isl-org/lang-seg.
Cite
Text
Li et al. "Language-Driven Semantic Segmentation." International Conference on Learning Representations, 2022.Markdown
[Li et al. "Language-Driven Semantic Segmentation." International Conference on Learning Representations, 2022.](https://mlanthology.org/iclr/2022/li2022iclr-languagedriven/)BibTeX
@inproceedings{li2022iclr-languagedriven,
title = {{Language-Driven Semantic Segmentation}},
author = {Li, Boyi and Weinberger, Kilian Q and Belongie, Serge and Koltun, Vladlen and Ranftl, Rene},
booktitle = {International Conference on Learning Representations},
year = {2022},
url = {https://mlanthology.org/iclr/2022/li2022iclr-languagedriven/}
}