CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?

Wang, Siqi; Liang, Chao; Gao, Yunfan; Yu, Erxin; Li, Sen; Li, Jing; Wang, Haofen

CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?

Siqi Wang, Chao Liang, Yunfan Gao, Erxin Yu, Sen Li, Jing Li, Haofen Wang

ICLR 2026

/iclr/2026/wang2026iclr-cityseeker/

Abstract

Vision-Language Models (VLMs) have made significant progress in explicit instruction-based navigation; however, their ability to interpret implicit human needs (e.g., ''I am thirsty'') in dynamic urban environments remains underexplored. This paper introduces CitySeeker, a novel benchmark designed to assess VLMs’ spatial reasoning and decision-making capabilities for exploring embodied urban navigation to address implicit needs. CitySeeker includes 6,440 trajectories across 8 cities, capturing diverse visual characteristics and implicit needs in 7 goal-driven scenarios. Extensive experiments reveal that even top-performing models (e.g., Qwen2.5-VL-32B-Instruct) achieve only 21.1% task completion. We find key bottlenecks in error accumulation in long-horizon reasoning, inadequate spatial cognition, and deficient experiential recall. To further analyze them, we investigate a series of exploratory strategies—Backtracking Mechanisms, Enriching Spatial Cognition, and Memory-Based Retrieval (BCR), inspired by human cognitive mapping's emphasis on iterative observation-reasoning cycles and adaptive path optimization. Our analysis provides actionable insights for developing VLMs with robust spatial intelligence required for tackling ''last-mile'' navigation challenges.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Wang et al. "CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?." International Conference on Learning Representations, 2026.

Markdown

[Wang et al. "CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/wang2026iclr-cityseeker/)

BibTeX

@inproceedings{wang2026iclr-cityseeker,
  title     = {{CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?}},
  author    = {Wang, Siqi and Liang, Chao and Gao, Yunfan and Yu, Erxin and Li, Sen and Li, Jing and Wang, Haofen},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/wang2026iclr-cityseeker/}
}