ScanFormer: Referring Expression Comprehension by Iteratively Scanning
Abstract
Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance they perform a dense perception of images which incorporates redundant visual regions unrelated to linguistic queries leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks with limited exploration in vision-language fields. To address this we propose a coarse-to-fine iterative perception framework called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration irrelevant patches are discarded by our designed informativeness prediction. Furthermore we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets namely RefCOCO RefCOCO+ RefCOCOg and ReferItGame verify the effectiveness of our method which can strike a balance between accuracy and efficiency.
Cite
Text
Su et al. "ScanFormer: Referring Expression Comprehension by Iteratively Scanning." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.01277Markdown
[Su et al. "ScanFormer: Referring Expression Comprehension by Iteratively Scanning." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/su2024cvpr-scanformer/) doi:10.1109/CVPR52733.2024.01277BibTeX
@inproceedings{su2024cvpr-scanformer,
title = {{ScanFormer: Referring Expression Comprehension by Iteratively Scanning}},
author = {Su, Wei and Miao, Peihan and Dou, Huanzhang and Li, Xi},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {13449-13458},
doi = {10.1109/CVPR52733.2024.01277},
url = {https://mlanthology.org/cvpr/2024/su2024cvpr-scanformer/}
}