Teaching Visual Language Models to Navigate Using Maps

Abstract

Visual Language Models (VLMs) have shown impressive abilities in understanding and gen- erating multimodal content by integrating visual and textual information. Recently, language- guided aerial navigation benchmarks have emerged, presenting a novel challenge for VLMs. In this work, we focus on the utilization of navigation maps, a critical component of the broader aerial navigation problem. We analyze the CityNav benchmark, a recently introduced dataset for language-goal aerial navigation that incorporates navigation maps and 3D point clouds of real cities to simulate environments for drones. We demonstrate that existing open-source VLMs perform poorly in understanding navigation maps in a zero-shot setting. To address this, we fine-tune one of the top-performing VLMs, Qwen2-VL, on map data, achieving near-perfect performance on a landmark-based navigation task. Notably, our fine-tuned Qwen2-VL model, using only the landmark map, achieves performance on par with the best baseline model in the CityNav benchmark. This highlights the potential of leveraging navigation maps for enhancing VLM capabilities in aerial navigation tasks.

Cite

Text

Galstyan et al. "Teaching Visual Language Models to Navigate Using Maps." ICLR 2025 Workshops: WRL, 2025.

Markdown

[Galstyan et al. "Teaching Visual Language Models to Navigate Using Maps." ICLR 2025 Workshops: WRL, 2025.](https://mlanthology.org/iclrw/2025/galstyan2025iclrw-teaching/)

BibTeX

@inproceedings{galstyan2025iclrw-teaching,
  title     = {{Teaching Visual Language Models to Navigate Using Maps}},
  author    = {Galstyan, Tigran and Tamazyan, Hakob and Nurijanyan, Narek},
  booktitle = {ICLR 2025 Workshops: WRL},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/galstyan2025iclrw-teaching/}
}