CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?

Abstract

Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., ''I am thirsty'') in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs’ spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies—Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling ''last-mile'' navigation challenges.

Cite

Text

Wang et al. "CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-cityseeker/)

BibTeX

@inproceedings{wang2026iclr-cityseeker,
  title     = {{CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?}},
  author    = {Wang, Siqi and Liang, Chao and Gao, Yunfan and Yu, Erxin and Li, Sen and Li, Jing and Wang, Haofen},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-cityseeker/}
}