CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

Abstract

Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on image-level tasks, leading to the research to adapt CLIP for open-vocabulary semantic segmentation without training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion and a fine-grained compensation. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate local details using the self-attention maps of diffusion model. We conduct the experiments on eight segmentation datasets. Our CLIPer achieves the state-of-the-art performance on these datasets. With ViT-L and sliding-window inference, CLIPer has the mIoU of 72.2% and 44.7% on VOC and Object, outperforming ProxyCLIP by 11.6% and 5.5%. Our code is available at https://github.com/linsun449/cliper.code.

Cite

Text

Sun et al. "CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation." International Conference on Computer Vision, 2025.

Markdown

[Sun et al. "CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/sun2025iccv-cliper/)

BibTeX

@inproceedings{sun2025iccv-cliper,
  title     = {{CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation}},
  author    = {Sun, Lin and Cao, Jiale and Xie, Jin and Jiang, Xiaoheng and Pang, Yanwei},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {23199-23209},
  url       = {https://mlanthology.org/iccv/2025/sun2025iccv-cliper/}
}