Good at Captioning, Bad at Counting: Benchmarking GPT-4V on Earth Observation Data

Zhang, Chenhui; Wang, Sherrie

doi:10.1109/CVPRW63382.2024.00780

Good at Captioning, Bad at Counting: Benchmarking GPT-4V on Earth Observation Data

Chenhui Zhang, Sherrie Wang

CVPRW 2024 pp. 7839-7849

doi:10.1109/CVPRW63382.2024.00780 /cvprw/2024/zhang2024cvprw-good/

Abstract

Large Vision-Language Models (VLMs) have demonstrated impressive performance on complex tasks involving visual input with natural language instructions. However, it remains unclear to what extent capabilities on natural images transfer to Earth observation (EO) data, which are predominantly satellite and aerial images less common in VLM training data. In this work, we propose a comprehensive benchmark to gauge the progress of VLMs toward being useful tools for EO data by assessing their abilities on scene understanding, localization and counting, and change detection. Motivated by real-world applications, our benchmark includes scenarios like urban monitoring, disaster relief, land use, and conservation. We discover that, although state-of-the-art VLMs like GPT-4V possess world knowledge that leads to strong performance on location understanding and image captioning, their poor spatial reasoning limits usefulness on object localization and counting. Our benchmark is publicly available on this website. A full version of this paper can be found here.

PDF CVPRW Semantic Scholar

Cite

Text

Zhang and Wang. "Good at Captioning, Bad at Counting: Benchmarking GPT-4V on Earth Observation Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00780

Markdown

[Zhang and Wang. "Good at Captioning, Bad at Counting: Benchmarking GPT-4V on Earth Observation Data." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/zhang2024cvprw-good/) doi:10.1109/CVPRW63382.2024.00780

BibTeX

@inproceedings{zhang2024cvprw-good,
  title     = {{Good at Captioning, Bad at Counting: Benchmarking GPT-4V on Earth Observation Data}},
  author    = {Zhang, Chenhui and Wang, Sherrie},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {7839-7849},
  doi       = {10.1109/CVPRW63382.2024.00780},
  url       = {https://mlanthology.org/cvprw/2024/zhang2024cvprw-good/}
}