MasQCLIP for Open-Vocabulary Universal Image Segmentation
Abstract
We present a new method for open-vocabulary universal image segmentation, which is capable of performing instance, semantic, and panoptic segmentation under a unified framework. Our approach, called MasQCLIP, seamlessly integrates with a pre-trained CLIP model by utilizing its dense features, thereby circumventing the need for extensive parameter training. MasQCLIP emphasizes two new aspects when building an image segmentation method with a CLIP model: 1) a student-teacher module to deal with masks of the novel (unseen) classes by distilling information from the base (seen) classes; 2) a fine-tuning process to update model parameters for the queries Q within the CLIP model. Thanks to these two simple and intuitive designs, MasQCLIP is able to achieve state-of-the-art performances with a substantial gain over the competing methods by a large margin across all three tasks, including open-vocabulary instance, semantic, and panoptic segmentation. Project page is at https://masqclip.github.io/.
Cite
Text
Xu et al. "MasQCLIP for Open-Vocabulary Universal Image Segmentation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00088Markdown
[Xu et al. "MasQCLIP for Open-Vocabulary Universal Image Segmentation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/xu2023iccv-masqclip/) doi:10.1109/ICCV51070.2023.00088BibTeX
@inproceedings{xu2023iccv-masqclip,
title = {{MasQCLIP for Open-Vocabulary Universal Image Segmentation}},
author = {Xu, Xin and Xiong, Tianyi and Ding, Zheng and Tu, Zhuowen},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {887-898},
doi = {10.1109/ICCV51070.2023.00088},
url = {https://mlanthology.org/iccv/2023/xu2023iccv-masqclip/}
}