What Can VLMs Do for Zero-Shot Embodied Task Planning?

Abstract

Recent advances in Vision Language Models (VLMs) for robotics demonstrate their enormous potential. However, the performance limitations of VLMs for embodied task planning, which require high precision and reliability, remain ambiguous, greatly constraining their potential application in this field. To this end, this paper provides an in-depth and comprehensive evaluation of VLM performance in zero-shot embodied task planning. Firstly, we develop a systematic evaluation framework encompassing various dimensions of capabilities essential for task planning for the first time. This framework aims to identify the factors that constrain VLMs in producing accurate task plans. Based on this framework, we propose a benchmark dataset called ETP-Bench to evaluate the performance of VLMs on embodied task planning. Extensive experiments indicate that the current state-of-the-art VLM, GPT-4V, achieves only 19% accuracy in task planning on our benchmark. The main factors contributing to this low accuracy are deficiencies in spatial perception and object type recognition. We hope this study can provide data support and inspire more specific research directions for future robotics research.

Cite

Text

Fu et al. "What Can VLMs Do for Zero-Shot Embodied Task Planning?." ICML 2024 Workshops: LLMs_and_Cognition, 2024.

Markdown

[Fu et al. "What Can VLMs Do for Zero-Shot Embodied Task Planning?." ICML 2024 Workshops: LLMs_and_Cognition, 2024.](https://mlanthology.org/icmlw/2024/fu2024icmlw-vlms/)

BibTeX

@inproceedings{fu2024icmlw-vlms,
  title     = {{What Can VLMs Do for Zero-Shot Embodied Task Planning?}},
  author    = {Fu, Xian and Zhang, Min and Hao, Jianye and Han, Peilong and Zhang, Hao and Shi, Lei and Tang, Hongyao},
  booktitle = {ICML 2024 Workshops: LLMs_and_Cognition},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/fu2024icmlw-vlms/}
}