DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Abstract

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Cite

Text

Tian et al. "DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models." Proceedings of The 8th Conference on Robot Learning, 2024.

Markdown

[Tian et al. "DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models." Proceedings of The 8th Conference on Robot Learning, 2024.](https://mlanthology.org/corl/2024/tian2024corl-drivevlm/)

BibTeX

@inproceedings{tian2024corl-drivevlm,
  title     = {{DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models}},
  author    = {Tian, Xiaoyu and Gu, Junru and Li, Bailin and Liu, Yicheng and Wang, Yang and Zhao, Zhiyong and Zhan, Kun and Jia, Peng and Lang, XianPeng and Zhao, Hang},
  booktitle = {Proceedings of The 8th Conference on Robot Learning},
  year      = {2024},
  pages     = {4698-4726},
  volume    = {270},
  url       = {https://mlanthology.org/corl/2024/tian2024corl-drivevlm/}
}