Locate Then Segment: A Strong Pipeline for Referring Image Segmentation
Abstract
Referring image segmentation aims to segment the objects referred by a natural language expression. Previous methods usually focus on designing an implicit and recurrent feature interaction mechanism to fuse the visual-linguistic features to directly generate the final segmentation mask without explicitly modeling the localization of the referent guided by language expression and designing a powerful segmentation module. To tackle these problems, we view this task from another perspective by decoupling it into a "locate-then-segment" (LTS) scheme. Given a language expression, people generally first perform attention to the corresponding target image regions, then generate a segmentation mask about the object based on its context. The LTS first extracts and fuses both visual and textual features to get a cross-modal representation, then applies a cross-model interaction on the visual-textual features to locate the referred object with position prior, and finally generates the segmentation result with a light-weight network. Our LTS is simple but surprisingly effective. On three popular benchmark datasets, the LTS outperforms all the previous state-of-the-arts methods by a large margin (e.g., +3.2% on RefCOCO+ and +3.4% on RefCOCOg). In addition, our model is more interpretable with explicitly locating the object, which is also proved by visualization experiments. Accordingly, this framework is very promising to serve as a pipeline for referring image segmentation.
Cite
Text
Jing et al. "Locate Then Segment: A Strong Pipeline for Referring Image Segmentation." Conference on Computer Vision and Pattern Recognition, 2021. doi:10.1109/CVPR46437.2021.00973Markdown
[Jing et al. "Locate Then Segment: A Strong Pipeline for Referring Image Segmentation." Conference on Computer Vision and Pattern Recognition, 2021.](https://mlanthology.org/cvpr/2021/jing2021cvpr-locate/) doi:10.1109/CVPR46437.2021.00973BibTeX
@inproceedings{jing2021cvpr-locate,
title = {{Locate Then Segment: A Strong Pipeline for Referring Image Segmentation}},
author = {Jing, Ya and Kong, Tao and Wang, Wei and Wang, Liang and Li, Lei and Tan, Tieniu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2021},
pages = {9858-9867},
doi = {10.1109/CVPR46437.2021.00973},
url = {https://mlanthology.org/cvpr/2021/jing2021cvpr-locate/}
}