GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Abstract

The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning in high-fidelity city-scale 3D environments via natural language.

Cite

Text

Yasuki et al. "GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields." International Conference on Computer Vision, 2025.

Markdown

[Yasuki et al. "GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/yasuki2025iccv-geoprog3d/)

BibTeX

@inproceedings{yasuki2025iccv-geoprog3d,
  title     = {{GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields}},
  author    = {Yasuki, Shunsuke and Miyanishi, Taiki and Inoue, Nakamasa and Kurita, Shuhei and Sakamoto, Koya and Azuma, Daichi and Taki, Masato and Matsuo, Yutaka},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {9737-9748},
  url       = {https://mlanthology.org/iccv/2025/yasuki2025iccv-geoprog3d/}
}