Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Abstract

The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps we introduce NuInstruct a novel dataset with 91K multi-view video-QA pairs across 17 subtasks where each task demands holistic information (e.g. temporal multi-view and spatial) significantly elevating the challenge level. To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. We further present BEV-InMLLM an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features language-aligned for large language models. BEV-InMLLM integrates multi-view spatial awareness and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs e.g 9% improvement on various tasks. We release our NuInstruct at https://github.com/xmed-lab/NuInstruct.

Cite

Text

Ding et al. "Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models." Conference on Computer Vision and Pattern Recognition, 2024.

Markdown

[Ding et al. "Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/ding2024cvpr-holistic/)

BibTeX

@inproceedings{ding2024cvpr-holistic,
  title     = {{Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models}},
  author    = {Ding, Xinpeng and Han, Jianhua and Xu, Hang and Liang, Xiaodan and Zhang, Wei and Li, Xiaomeng},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {13668-13677},
  url       = {https://mlanthology.org/cvpr/2024/ding2024cvpr-holistic/}
}