nuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Qian, Tianwen; Chen, Jingjing; Zhuo, Linhai; Jiao, Yang; Jiang, Yu-Gang

doi:10.1609/AAAI.V38I5.28253

nuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, Yu-Gang Jiang

AAAI 2024 pp. 4542-4550

doi:10.1609/AAAI.V38I5.28253 /aaai/2024/qian2024aaai-nuscenes/

Abstract

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

PDF AAAI Semantic Scholar

Cite

Text

Qian et al. "nuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario." AAAI Conference on Artificial Intelligence, 2024. doi:10.1609/AAAI.V38I5.28253

Markdown

[Qian et al. "nuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario." AAAI Conference on Artificial Intelligence, 2024.](https://mlanthology.org/aaai/2024/qian2024aaai-nuscenes/) doi:10.1609/AAAI.V38I5.28253

BibTeX

@inproceedings{qian2024aaai-nuscenes,
  title     = {{nuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving Scenario}},
  author    = {Qian, Tianwen and Chen, Jingjing and Zhuo, Linhai and Jiao, Yang and Jiang, Yu-Gang},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2024},
  pages     = {4542-4550},
  doi       = {10.1609/AAAI.V38I5.28253},
  url       = {https://mlanthology.org/aaai/2024/qian2024aaai-nuscenes/}
}