CRIS: CLIP-Driven Referring Image Segmentation

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.

Cite

Text

Wang et al. "CRIS: CLIP-Driven Referring Image Segmentation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01139

Markdown

[Wang et al. "CRIS: CLIP-Driven Referring Image Segmentation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/wang2022cvpr-cris/) doi:10.1109/CVPR52688.2022.01139

BibTeX

@inproceedings{wang2022cvpr-cris,
  title     = {{CRIS: CLIP-Driven Referring Image Segmentation}},
  author    = {Wang, Zhaoqing and Lu, Yu and Li, Qiang and Tao, Xunqiang and Guo, Yandong and Gong, Mingming and Liu, Tongliang},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {11686-11695},
  doi       = {10.1109/CVPR52688.2022.01139},
  url       = {https://mlanthology.org/cvpr/2022/wang2022cvpr-cris/}
}