Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency
Abstract
Referring image segmentation (RIS) aims to localize the object in an image referred by a natural language expression. Most previous studies learn RIS with a large-scale dataset containing segmentation labels, but they are costly. We present a weakly supervised learning method for RIS that only uses readily available image-text pairs. We first train a visual-linguistic model for image-text matching and extract a visual saliency map through Grad-CAM to identify the image regions corresponding to each word. However, we found two major problems with Grad-CAM. First, it lacks consideration of critical semantic relationships between words. We tackle this problem by modeling the relationship between words through intra-chunk and inter-chunk consistency. Second, Grad-CAM identifies only small regions of the referred object, leading to low recall. Therefore, we refine the localization maps with self-attention in Transformer and unsupervised object shape prior. On three popular benchmarks (RefCOCO, RefCOCO+, G-Ref), our method significantly outperforms recent comparable techniques. We also show that our method is applicable to various levels of supervision and obtains better performance than recent methods.
Cite
Text
Lee et al. "Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.01999Markdown
[Lee et al. "Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/lee2023iccv-weakly/) doi:10.1109/ICCV51070.2023.01999BibTeX
@inproceedings{lee2023iccv-weakly,
title = {{Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency}},
author = {Lee, Jungbeom and Lee, Sungjin and Nam, Jinseok and Yu, Seunghak and Do, Jaeyoung and Taghavi, Tara},
booktitle = {International Conference on Computer Vision},
year = {2023},
pages = {21870-21881},
doi = {10.1109/ICCV51070.2023.01999},
url = {https://mlanthology.org/iccv/2023/lee2023iccv-weakly/}
}