Diffusion Models for Open-Vocabulary Segmentation
Abstract
Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-language modelling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present , a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.
Cite
Text
Karazija et al. "Diffusion Models for Open-Vocabulary Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72652-1_18Markdown
[Karazija et al. "Diffusion Models for Open-Vocabulary Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/karazija2024eccv-diffusion/) doi:10.1007/978-3-031-72652-1_18BibTeX
@inproceedings{karazija2024eccv-diffusion,
title = {{Diffusion Models for Open-Vocabulary Segmentation}},
author = {Karazija, Laurynas and Laina, Iro and Vedaldi, Andrea and Rupprecht, Christian},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-72652-1_18},
url = {https://mlanthology.org/eccv/2024/karazija2024eccv-diffusion/}
}