CLIP as RNN: Segment Countless Visual Concepts Without Training Endeavor
Abstract
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of categories in segmentation datasets. Consequently the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However without fine-tuning VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts but also those fine-tuned with millions of data samples and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely we improve the current record by 28.8 16.0 and 6.9 mIoU on Pascal VOC COCO Object and Pascal Context.
Cite
Text
Sun et al. "CLIP as RNN: Segment Countless Visual Concepts Without Training Endeavor." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01251Markdown
[Sun et al. "CLIP as RNN: Segment Countless Visual Concepts Without Training Endeavor." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/sun2024cvpr-clip/) doi:10.1109/CVPR52733.2024.01251BibTeX
@inproceedings{sun2024cvpr-clip,
title = {{CLIP as RNN: Segment Countless Visual Concepts Without Training Endeavor}},
author = {Sun, Shuyang and Li, Runjia and Torr, Philip and Gu, Xiuye and Li, Siyang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13171-13182},
doi = {10.1109/CVPR52733.2024.01251},
url = {https://mlanthology.org/cvpr/2024/sun2024cvpr-clip/}
}