When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach

Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, Lu Fang

CVPR 2024 pp. 22119-22128

doi:10.1109/CVPR52733.2024.02088 /cvpr/2024/ma2024cvpr-visual/

Abstract

Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.

PDF CVPR Semantic Scholar

Cite

Text

Ma et al. "When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02088

Markdown

[Ma et al. "When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ma2024cvpr-visual/) doi:10.1109/CVPR52733.2024.02088

BibTeX

@inproceedings{ma2024cvpr-visual,
  title     = {{When Visual Grounding Meets Gigapixel-Level Large-Scale Scenes: Benchmark and Approach}},
  author    = {Ma, Tao and Bai, Bing and Lin, Haozhe and Wang, Heyuan and Wang, Yu and Luo, Lin and Fang, Lu},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {22119-22128},
  doi       = {10.1109/CVPR52733.2024.02088},
  url       = {https://mlanthology.org/cvpr/2024/ma2024cvpr-visual/}
}