ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Abstract
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP’s image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP’s representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.
Cite
Text
Lan et al. "ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72970-6_9Markdown
[Lan et al. "ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/lan2024eccv-clearclip/) doi:10.1007/978-3-031-72970-6_9BibTeX
@inproceedings{lan2024eccv-clearclip,
title = {{ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference}},
author = {Lan, Mengcheng and Chen, Chaofeng and Ke, Yiping and Wang, Xinjiang and Feng, Litong and Zhang, Wayne},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72970-6_9},
url = {https://mlanthology.org/eccv/2024/lan2024eccv-clearclip/}
}