OpenCity3D: What Do Vision-Language Models Know About Urban Environments?

Abstract

The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks. Existing works focus on indoor scenes or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes. In this work we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities. In particular we address higher-level 3D scene understanding tasks such as population density building age property prices crime rate and noise pollution. Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.

Cite

Text

Bieri et al. "OpenCity3D: What Do Vision-Language Models Know About Urban Environments?." Winter Conference on Applications of Computer Vision, 2025.

Markdown

[Bieri et al. "OpenCity3D: What Do Vision-Language Models Know About Urban Environments?." Winter Conference on Applications of Computer Vision, 2025.](https://mlanthology.org/wacv/2025/bieri2025wacv-opencity3d/)

BibTeX

@inproceedings{bieri2025wacv-opencity3d,
  title     = {{OpenCity3D: What Do Vision-Language Models Know About Urban Environments?}},
  author    = {Bieri, Valentin and Zamboni, Marco and Blumer, Nicolas Samuel and Chen, Qingxuan and Engelmann, Francis},
  booktitle = {Winter Conference on Applications of Computer Vision},
  year      = {2025},
  pages     = {5147-5155},
  url       = {https://mlanthology.org/wacv/2025/bieri2025wacv-opencity3d/}
}