LingoQA: Video Question Answering for Autonomous Driving

Abstract

We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving. The dataset contains 28K unique short video scenarios, and 419K annotations. Evaluating state-of-the-art vision-language models on our benchmark shows that their performance is below human capabilities, with GPT-4V responding truthfully to 59.6% of the questions compared to 96.6% for humans. For evaluation, we propose a truthfulness classifier, called Lingo-Judge, that achieves a 0.95 Spearman correlation coefficient to human evaluations, surpassing existing techniques like METEOR, BLEU, CIDEr, and GPT-4. We establish a baseline vision-language model and run extensive ablation studies to understand its performance. We release our dataset and benchmark1 as an evaluation platform for vision-language models in autonomous driving. 1 https://github.com/wayveai/LingoQA

Cite

Text

Marcu et al. "LingoQA: Video Question Answering for Autonomous Driving." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72980-5_15

Markdown

[Marcu et al. "LingoQA: Video Question Answering for Autonomous Driving." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/marcu2024eccv-lingoqa/) doi:10.1007/978-3-031-72980-5_15

BibTeX

@inproceedings{marcu2024eccv-lingoqa,
  title     = {{LingoQA: Video Question Answering for Autonomous Driving}},
  author    = {Marcu, Ana-Maria and Chen, Long and Hünermann, Jan and Karnsund, Alice and Hanotte, Benoit and Chidananda, Prajwal and Nair, Saurabh and Badrinarayanan, Vijay and Kendall, Alex and Shotton, Jamie and Arani, Elahe and Sinavski, Oleg},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72980-5_15},
  url       = {https://mlanthology.org/eccv/2024/marcu2024eccv-lingoqa/}
}