OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Abstract

The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.

Cite

Text

Wang et al. "OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02090

Markdown

[Wang et al. "OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wang2025cvpr-omnidrive/) doi:10.1109/CVPR52734.2025.02090

BibTeX

@inproceedings{wang2025cvpr-omnidrive,
  title     = {{OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning}},
  author    = {Wang, Shihao and Yu, Zhiding and Jiang, Xiaohui and Lan, Shiyi and Shi, Min and Chang, Nadine and Kautz, Jan and Li, Ying and Alvarez, Jose M.},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {22442-22452},
  doi       = {10.1109/CVPR52734.2025.02090},
  url       = {https://mlanthology.org/cvpr/2025/wang2025cvpr-omnidrive/}
}