ImagineNav: Prompting Vision-Language Models as Embodied Navigator Through Scene Imagination

Zhao, Xinxin; Cai, Wenzhe; Tang, Likun; Wang, Teng

ImagineNav: Prompting Vision-Language Models as Embodied Navigator Through Scene Imagination

Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang

ICLR 2025

/iclr/2025/zhao2025iclr-imaginenav/

Abstract

Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLMs is limited within texts and it is difficult to represent the spatial occupancy and geometry layout only by texts. Both are important for making rational navigation decisions. In this work, we seek to unleash the spatial perception and planning ability of Vision-Language Models (VLMs), and explore whether the VLM, with only on-board camera captured RGB/RGB-D stream inputs, can efficiently finish the visual navigation tasks in a mapless manner. We achieve this by developing the imagination-powered navigation framework ImagineNav, which imagines the future observation images at valuable robot views and translates the complex navigation planning process into a rather simple best-view image selection problem for VLM. To generate appropriate candidate robot views for imagination, we introduce the Where2Imagine module, which is distilled to align with human navigation habits. Finally, to reach the VLM preferred views, an off-the-shelf point-goal navigation policy is utilized. Empirical experiments on the challenging open-vocabulary object navigation benchmarks demonstrates the superiority of our proposed system.

PDF ICLR Semantic Scholar

Cite

Text

Zhao et al. "ImagineNav: Prompting Vision-Language Models as Embodied Navigator Through Scene Imagination." International Conference on Learning Representations, 2025.

Markdown

[Zhao et al. "ImagineNav: Prompting Vision-Language Models as Embodied Navigator Through Scene Imagination." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zhao2025iclr-imaginenav/)

BibTeX

@inproceedings{zhao2025iclr-imaginenav,
  title     = {{ImagineNav: Prompting Vision-Language Models as Embodied Navigator Through Scene Imagination}},
  author    = {Zhao, Xinxin and Cai, Wenzhe and Tang, Likun and Wang, Teng},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/zhao2025iclr-imaginenav/}
}