TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Charoenpitaks, Korawat; Nguyen, Van-Quang; Suganuma, Masanori; Arai, Kentaro; Totsuka, Seiji; Ino, Hiroshi; Okatani, Takayuki

TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos

Korawat Charoenpitaks, Van-Quang Nguyen, Masanori Suganuma, Kentaro Arai, Seiji Totsuka, Hiroshi Ino, Takayuki Okatani

CVPRW 2025 pp. 2445-2455

/cvprw/2025/charoenpitaks2025cvprw-tbbench/

Abstract

The application of Multi-modal Large Language Models (MLLMs) in Autonomous Driving (AD) faces significant challenges due to their limited training on traffic-specific data and the absence of dedicated benchmarks for spatiotemporal understanding. This study addresses these issues by proposing TB-Bench, a comprehensive benchmark designed to evaluate MLLMs on understanding traffic behaviors across eight perception tasks from ego-centric views. We also introduce vision-language instruction tuning datasets, TB-100k and TB-250k, along with simple yet effective baselines for the tasks. Through extensive experiments, we show that existing MLLMs underperform in these tasks, with even a powerful model like GPT-4o achieving less than 35% accuracy on average. In contrast, when fine-tuned with TB-100k or TB-250k, our baseline models achieve average accuracy up to 85%, significantly enhancing performance on the tasks. Additionally, we demonstrate performance transfer to another driving benchmark by co-training a model on the other driving benchmark dataset with our proposed dataset. The benchmark, datasets, and code will be available at https://github.com/TB-AD/TB-Bench.

PDF CVPRW Semantic Scholar

Cite

Text

Charoenpitaks et al. "TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Charoenpitaks et al. "TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/charoenpitaks2025cvprw-tbbench/)

BibTeX

@inproceedings{charoenpitaks2025cvprw-tbbench,
  title     = {{TB-Bench: Training and Testing Multi-Modal AI for Understanding Spatio-Temporal Traffic Behaviors from Dashcam Images/Videos}},
  author    = {Charoenpitaks, Korawat and Nguyen, Van-Quang and Suganuma, Masanori and Arai, Kentaro and Totsuka, Seiji and Ino, Hiroshi and Okatani, Takayuki},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {2445-2455},
  url       = {https://mlanthology.org/cvprw/2025/charoenpitaks2025cvprw-tbbench/}
}