ScanFormer: Referring Expression Comprehension by Iteratively Scanning

Abstract

Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance they perform a dense perception of images which incorporates redundant visual regions unrelated to linguistic queries leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks with limited exploration in vision-language fields. To address this we propose a coarse-to-fine iterative perception framework called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration irrelevant patches are discarded by our designed informativeness prediction. Furthermore we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets namely RefCOCO RefCOCO+ RefCOCOg and ReferItGame verify the effectiveness of our method which can strike a balance between accuracy and efficiency.

Cite

Text

Su et al. "ScanFormer: Referring Expression Comprehension by Iteratively Scanning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01277

Markdown

[Su et al. "ScanFormer: Referring Expression Comprehension by Iteratively Scanning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/su2024cvpr-scanformer/) doi:10.1109/CVPR52733.2024.01277

BibTeX

@inproceedings{su2024cvpr-scanformer,
  title     = {{ScanFormer: Referring Expression Comprehension by Iteratively Scanning}},
  author    = {Su, Wei and Miao, Peihan and Dou, Huanzhang and Li, Xi},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13449-13458},
  doi       = {10.1109/CVPR52733.2024.01277},
  url       = {https://mlanthology.org/cvpr/2024/su2024cvpr-scanformer/}
}