NavBench: Probing Multimodal Large Language Models for Embodied Navigation

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong generalization in vision-language tasks, yet their ability to understand and act within embodied environments remains underexplored. We present NavBench, a benchmark to evaluate the embodied navigation capabilities of MLLMs under zero-shot settings. NavBench consists of two components: (1) navigation comprehension, assessed through three cognitively grounded tasks including global instruction alignment, temporal progress estimation, and local observation-action reasoning, covering 3,200 question-answer pairs; and (2) step-by-step execution in 432 episodes across 72 indoor scenes, stratified by spatial, cognitive, and execution complexity. To support real-world deployment, we introduce a pipeline that converts MLLMs' outputs into robotic actions. We evaluate both proprietary and open-source models, finding that GPT-4o performs well across tasks, while lighter open-source models succeed in simpler cases. Results also show that models with higher comprehension scores tend to achieve better execution performance. Providing map-based context improves decision accuracy, especially in medium-difficulty scenarios. However, most models struggle with temporal understanding, particularly in estimating progress during navigation, which may pose a key challenge.

Cite

Text

Qiao et al. "NavBench: Probing Multimodal Large Language Models for Embodied Navigation." Advances in Neural Information Processing Systems, 2025.

Markdown

[Qiao et al. "NavBench: Probing Multimodal Large Language Models for Embodied Navigation." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/qiao2025neurips-navbench/)

BibTeX

@inproceedings{qiao2025neurips-navbench,
  title     = {{NavBench: Probing Multimodal Large Language Models for Embodied Navigation}},
  author    = {Qiao, Yanyuan and Hong, Haodong and Lyu, Wenqi and An, Dong and Zhang, Siqi and Xie, Yutong and Wang, Xinyu and Wu, Qi},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/qiao2025neurips-navbench/}
}