MAPLM: A Real-World Large-Scale Vision-Language Benchmark for mAP and Traffic Scene Understanding

Abstract

Vision-language generative AI has demonstrated remarkable promise for empowering cross-modal scene understanding of autonomous driving and high-definition (HD) map systems. However current benchmark datasets lack multi-modal point cloud image and language data pairs. Recent approaches utilize visual instruction learning and cross-modal prompt engineering to expand vision-language models into this domain. In this paper we propose a new vision-language benchmark that can be used to finetune traffic and HD map domain-specific foundation models. Specifically we annotate and leverage large-scale broad-coverage traffic and map data extracted from huge HD map annotations and use CLIP and LLaMA-2 / Vicuna to finetune a baseline model with instruction-following data. Our experimental results across various algorithms reveal that while visual instruction-tuning large language models (LLMs) can effectively learn meaningful representations from MAPLM-QA there remains significant room for further advancements. To facilitate applying LLMs and multi-modal data into self-driving research we will release our visual-language QA data and the baseline models at GitHub.com/LLVM-AD/MAPLM.

Cite

Text

Cao et al. "MAPLM: A Real-World Large-Scale Vision-Language Benchmark for mAP and Traffic Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02061

Markdown

[Cao et al. "MAPLM: A Real-World Large-Scale Vision-Language Benchmark for mAP and Traffic Scene Understanding." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/cao2024cvpr-maplm/) doi:10.1109/CVPR52733.2024.02061

BibTeX

@inproceedings{cao2024cvpr-maplm,
  title     = {{MAPLM: A Real-World Large-Scale Vision-Language Benchmark for mAP and Traffic Scene Understanding}},
  author    = {Cao, Xu and Zhou, Tong and Ma, Yunsheng and Ye, Wenqian and Cui, Can and Tang, Kun and Cao, Zhipeng and Liang, Kaizhao and Wang, Ziran and Rehg, James M. and Zheng, Chao},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {21819-21830},
  doi       = {10.1109/CVPR52733.2024.02061},
  url       = {https://mlanthology.org/cvpr/2024/cao2024cvpr-maplm/}
}