SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation

Ouyang, Shuyi; Wang, Hongyi; Xie, Shiao; Niu, Ziwei; Tong, Ruofeng; Chen, Yen-Wei; Lin, Lanfen

doi:10.24963/IJCAI.2023/144

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation

Shuyi Ouyang, Hongyi Wang, Shiao Xie, Ziwei Niu, Ruofeng Tong, Yen-Wei Chen, Lanfen Lin

IJCAI 2023 pp. 1294-1302

doi:10.24963/IJCAI.2023/144 /ijcai/2023/ouyang2023ijcai-slvit/

Abstract

Referring image segmentation aims to segment an object out of an image via a specific language expression. The main concept is establishing global visual-linguistic relationships to locate the object and identify boundaries using details of the image. Recently, various Transformer-based techniques have been proposed to efficiently leverage long-range cross-modal dependencies, enhancing performance for referring segmentation. However, existing methods consider visual feature extraction and cross-modal fusion separately, resulting in insufficient visual-linguistic alignment in semantic space. In addition, they employ sequential structures and hence lack multi-scale information interaction. To address these limitations, we propose a Scale-Wise Language-Guided Vision Transformer (SLViT) with two appealing designs: (1) Language-Guided Multi-Scale Fusion Attention, a novel attention mechanism module for extracting rich local visual information and modeling global visual-linguistic relationships in an integrated manner. (2) An Uncertain Region Cross-Scale Enhancement module that can identify regions of high uncertainty using linguistic features and refine them via aggregated multi-scale features. We have evaluated our method on three benchmark datasets. The experimental results demonstrate that SLViT surpasses state-of-the-art methods with lower computational cost. The code is publicly available at: https://github.com/NaturalKnight/SLViT.

PDF IJCAI Semantic Scholar

Cite

Text

Ouyang et al. "SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation." International Joint Conference on Artificial Intelligence, 2023. doi:10.24963/IJCAI.2023/144

Markdown

[Ouyang et al. "SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation." International Joint Conference on Artificial Intelligence, 2023.](https://mlanthology.org/ijcai/2023/ouyang2023ijcai-slvit/) doi:10.24963/IJCAI.2023/144

BibTeX

@inproceedings{ouyang2023ijcai-slvit,
  title     = {{SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation}},
  author    = {Ouyang, Shuyi and Wang, Hongyi and Xie, Shiao and Niu, Ziwei and Tong, Ruofeng and Chen, Yen-Wei and Lin, Lanfen},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2023},
  pages     = {1294-1302},
  doi       = {10.24963/IJCAI.2023/144},
  url       = {https://mlanthology.org/ijcai/2023/ouyang2023ijcai-slvit/}
}