Exploring Simple Open-Vocabulary Semantic Segmentation

Abstract

Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts. In order to learn such pixel-level alignment, current approaches typically rely on a combination of (i) image-level VL model (e.g. CLIP), (ii) ground truth masks, (iii) custom grouping encoders, and (iv) the Segment Anything Model (SAM). In this paper, we introduce S-Seg, a simple model that can achieve surprisingly strong performance without depending on any of the above elements. S-Seg leverages pseudo-mask and language to train a MaskFormer, and can be easily trained from publicly available image-text datasets. Contrary to prior works, our model directly trains for pixel-level features and language alignment. Once trained, S-Seg generalizes well to multiple testing datasets without requiring fine-tuning. In addition, S-Seg has the extra benefits of scalability with data and consistently improving when augmented with self-training. We believe that our simple yet effective approach will serve as a solid baseline for future research. Our code and demo will be made publicly available soon.

Cite

Text

Lai. "Exploring Simple Open-Vocabulary Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02813

Markdown

[Lai. "Exploring Simple Open-Vocabulary Semantic Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/lai2025cvpr-exploring/) doi:10.1109/CVPR52734.2025.02813

BibTeX

@inproceedings{lai2025cvpr-exploring,
  title     = {{Exploring Simple Open-Vocabulary Semantic Segmentation}},
  author    = {Lai, Zihang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {30221-30230},
  doi       = {10.1109/CVPR52734.2025.02813},
  url       = {https://mlanthology.org/cvpr/2025/lai2025cvpr-exploring/}
}