Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation

Abstract

Referring Image Segmentation (RIS) – the problem of identifying objects in images through natural language sentences – is a challenging task currently mostly solved through supervised learning. However, while collecting referred annotation masks is a time-consuming process, the few existing weakly-supervised and zero-shot approaches fall significantly short in performance compared to fully-supervised learning ones. To bridge the performance gap without mask annotations, we propose a novel weakly-supervised framework that tackles RIS by decomposing it into three steps: obtaining instance masks for the object mentioned in the referencing instruction ( segment ), using zero-shot learning to select a potentially correct mask for the given instruction ( select ), and bootstrapping a model which allows for fixing the mistakes of zero-shot selection ( correct ). In our experiments, using only the first two steps (zero-shot segment and select) outperforms other zero-shot baselines by as much as 16.5%, while our full method improves upon this much stronger baseline and sets the new state-of-the-art for weakly-supervised RIS, reducing the gap between the weakly-supervised and fully-supervised methods in some cases from around 33% to as little as 7%. Code is available at https://github.com/fgirbal/segment-select-correct .

Cite

Text

Eiras et al. "Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91856-8_19

Markdown

[Eiras et al. "Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/eiras2024eccvw-segment/) doi:10.1007/978-3-031-91856-8_19

BibTeX

@inproceedings{eiras2024eccvw-segment,
  title     = {{Segment, Select, Correct: A Framework for Weakly-Supervised Referring Segmentation}},
  author    = {Eiras, Francisco and Oksuz, Kemal and Bibi, Adel and Torr, Philip H. S. and Dokania, Puneet K.},
  booktitle = {European Conference on Computer Vision Workshops},
  year      = {2024},
  pages     = {326-342},
  doi       = {10.1007/978-3-031-91856-8_19},
  url       = {https://mlanthology.org/eccvw/2024/eiras2024eccvw-segment/}
}