When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach
Abstract
Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.
Cite
Text
Ma et al. "When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02088Markdown
[Ma et al. "When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ma2024cvpr-visual/) doi:10.1109/CVPR52733.2024.02088BibTeX
@inproceedings{ma2024cvpr-visual,
title = {{When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach}},
author = {Ma, Tao and Bai, Bing and Lin, Haozhe and Wang, Heyuan and Wang, Yu and Luo, Lin and Fang, Lu},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2024},
pages = {22119-22128},
doi = {10.1109/CVPR52733.2024.02088},
url = {https://mlanthology.org/cvpr/2024/ma2024cvpr-visual/}
}