MasQCLIP for Open-Vocabulary Universal Image Segmentation

Abstract

We present a new method for open-vocabulary universal image segmentation, which is capable of performing instance, semantic, and panoptic segmentation under a unified framework. Our approach, called MasQCLIP, seamlessly integrates with a pre-trained CLIP model by utilizing its dense features, thereby circumventing the need for extensive parameter training. MasQCLIP emphasizes two new aspects when building an image segmentation method with a CLIP model: 1) a student-teacher module to deal with masks of the novel (unseen) classes by distilling information from the base (seen) classes; 2) a fine-tuning process to update model parameters for the queries Q within the CLIP model. Thanks to these two simple and intuitive designs, MasQCLIP is able to achieve state-of-the-art performances with a substantial gain over the competing methods by a large margin across all three tasks, including open-vocabulary instance, semantic, and panoptic segmentation. Project page is at https://masqclip.github.io/.

Cite

Text

Xu et al. "MasQCLIP for Open-Vocabulary Universal Image Segmentation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00088

Markdown

[Xu et al. "MasQCLIP for Open-Vocabulary Universal Image Segmentation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/xu2023iccv-masqclip/) doi:10.1109/ICCV51070.2023.00088

BibTeX

@inproceedings{xu2023iccv-masqclip,
  title     = {{MasQCLIP for Open-Vocabulary Universal Image Segmentation}},
  author    = {Xu, Xin and Xiong, Tianyi and Ding, Zheng and Tu, Zhuowen},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {887-898},
  doi       = {10.1109/ICCV51070.2023.00088},
  url       = {https://mlanthology.org/iccv/2023/xu2023iccv-masqclip/}
}