V3LMA: Visual 3D-Enhanced Language Model for Autonomous Driving

Abstract

LVLMs have shown strong capabilities in understanding and analyzing visual scenes across various domains. However, in the context of autonomous driving, their limited comprehension of 3D environments restricts their effectiveness in achieving a complete and safe understanding of dynamic surroundings. To address this, we introduce V3LMA, a novel approach that enhances 3D scene understanding by integrating LLMs with LVLMs. V3LMA leverages textual descriptions generated from object detections and video inputs, significantly boosting performance without requiring fine-tuning. Through a dedicated preprocessing pipeline that extracts 3D object data, our method improves situational awareness and decision-making in complex traffic scenarios, achieving a score of 0.56 on the LingoQA benchmark. We further explore different fusion strategies and token combinations with the goal of advancing the interpretation of traffic scenes, ultimately enabling safer autonomous driving systems.

Cite

Text

Lübberstedt et al. "V3LMA: Visual 3D-Enhanced Language Model for Autonomous Driving." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Lübberstedt et al. "V3LMA: Visual 3D-Enhanced Language Model for Autonomous Driving." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/lubberstedt2025cvprw-v3lma/)

BibTeX

@inproceedings{lubberstedt2025cvprw-v3lma,
  title     = {{V3LMA: Visual 3D-Enhanced Language Model for Autonomous Driving}},
  author    = {Lübberstedt, Jannik and Rivera, Esteban and Uhlemann, Nico and Lienkamp, Markus},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {4769-4778},
  url       = {https://mlanthology.org/cvprw/2025/lubberstedt2025cvprw-v3lma/}
}