TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Abstract

We present TUMTraf VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraf VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding—within a cohesive evaluation framework. We further introduce the TraffiX-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset’s complexity, highlight the limitations of existing models, and position TUMTraf VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Cite

Text

Zhou et al. "TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Zhou et al. "TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/zhou2025icml-tumtraf/)

BibTeX

@inproceedings{zhou2025icml-tumtraf,
  title     = {{TUMTraf VideoQA: Dataset and Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes}},
  author    = {Zhou, Xingcheng and Larintzakis, Konstantinos and Guo, Hao and Zimmer, Walter and Liu, Mingyu and Cao, Hu and Zhang, Jiajie and Lakshminarasimhan, Venkatnarayanan and Strand, Leah and Knoll, Alois},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {78786-78811},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/zhou2025icml-tumtraf/}
}