Recognition Through Reasoning: Reinforcing Image Geo-Localization with Large Vision-Language Models

Abstract

Previous methods for image geo-localization have typically treated the task as either classification or retrieval, often relying on black-box decisions that lack interpretability. The rise of large vision-language models (LVLMs) has enabled a rethinking of geo-localization as a reasoning-driven task grounded in visual cues. However, two major challenges persist. On the data side, existing reasoning-focused datasets are primarily based on street-view imagery, offering limited scene diversity and constrained viewpoints. On the modeling side, current approaches predominantly rely on supervised fine-tuning, which yields only marginal improvements in reasoning capabilities. To address these challenges, we propose a novel pipeline that constructs a reasoning-oriented geo-localization dataset, $\textit{MP16-Reason}$, using diverse social media images. We introduce $\textit{GLOBE}$, $\textbf{G}$roup-relative policy optimization for $\textbf{L}$ocalizability assessment and $\textbf{O}$ptimized visual-cue reasoning, yielding $\textbf{B}$i-objective geo-$\textbf{E}$nhancement for the VLM in recognition and reasoning. $\textit{GLOBE}$ incorporates task-specific rewards that jointly enhance localizability assessment, visual-cue reasoning, and geolocation accuracy. Both qualitative and quantitative results demonstrate that $\textit{GLOBE}$ outperforms state-of-the-art open-source LVLMs on geo-localization tasks, particularly in diverse visual scenes, while also generating more insightful and interpretable reasoning trajectories. The data and code are available at https://github.com/lingli1996/GLOBE.

Cite

Text

Li et al. "Recognition Through Reasoning: Reinforcing Image Geo-Localization with Large Vision-Language Models." Advances in Neural Information Processing Systems, 2025.

Markdown

[Li et al. "Recognition Through Reasoning: Reinforcing Image Geo-Localization with Large Vision-Language Models." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/li2025neurips-recognition/)

BibTeX

@inproceedings{li2025neurips-recognition,
  title     = {{Recognition Through Reasoning: Reinforcing Image Geo-Localization with Large Vision-Language Models}},
  author    = {Li, Ling and Zhou, Yao and Liang, Yuxuan and Tsung, Fugee and Wei, Jiaheng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/li2025neurips-recognition/}
}