SLAN: Self-Locator Aided Network for Vision-Language Understanding

Jiang-Tian Zhai, Qi Zhang, Tong Wu, Xing-Yu Chen, Jiang-Jiang Liu, Ming-Ming Cheng

ICCV 2023 pp. 21949-21958

doi:10.1109/ICCV51070.2023.02006 /iccv/2023/zhai2023iccv-slan/

Abstract

Learning fine-grained interplay between vision and language contributes to a more accurate understanding for Vision-Language tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by text-agnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to their heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for vision-language understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating vision-language information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments, SLAN can be easily generalized to many downstream tasks. It achieves fairly competitive results on five vision-language understanding tasks (e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval, surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and fine-tuned transferability to two localization tasks.

PDF ICCV Semantic Scholar

Cite

Text

Zhai et al. "SLAN: Self-Locator Aided Network for Vision-Language Understanding." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.02006

Markdown

[Zhai et al. "SLAN: Self-Locator Aided Network for Vision-Language Understanding." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/zhai2023iccv-slan/) doi:10.1109/ICCV51070.2023.02006

BibTeX

@inproceedings{zhai2023iccv-slan,
  title     = {{SLAN: Self-Locator Aided Network for Vision-Language Understanding}},
  author    = {Zhai, Jiang-Tian and Zhang, Qi and Wu, Tong and Chen, Xing-Yu and Liu, Jiang-Jiang and Cheng, Ming-Ming},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {21949-21958},
  doi       = {10.1109/ICCV51070.2023.02006},
  url       = {https://mlanthology.org/iccv/2023/zhai2023iccv-slan/}
}