TemporalBench: Benchmarking Fine-Grained Temporal Understanding for Multimodal Video Models

Abstract

Understanding fine-grained temporal dynamics is crucial for video understanding. Yet, popular video benchmarks, such as MSRVTT and TGIF, often fail to effectively evaluate AI models' temporal reasoning abilities due to the lack of fine-grained temporal annotations. As a result, text-based models, leveraging strong language priors, often perform comparably to video models, and image-trained models have been reported to outperform their video-trained counterparts on MSRVTT and TGIF. This paper introduces a new TemporalBench benchmark for fine-grained temporal event understanding in videos. TemporalBench, sourced from a diverse video datasets, consists of ∼10K pairs of video description questions, derived from ∼2K high-quality human-annotated video captions. Uniquely, our benchmark provides fine-grained temporal annotations to evaluate models' temporal reasoning abilities. Our results show that state-of-the-art models like GPT-4o achieve only 38.0% multiple binary QA accuracy on TemporalBench, demonstrating a significant human-AI gap in temporal understanding. We hope that TemporalBench is instrumental to fostering research on improving models' temporal reasoning capabilities.

Cite

Text

Cai et al. "TemporalBench: Benchmarking Fine-Grained Temporal Understanding for Multimodal Video Models." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.

Markdown

[Cai et al. "TemporalBench: Benchmarking Fine-Grained Temporal Understanding for Multimodal Video Models." NeurIPS 2024 Workshops: Video-Langauge_Models, 2024.](https://mlanthology.org/neuripsw/2024/cai2024neuripsw-temporalbench/)

BibTeX

@inproceedings{cai2024neuripsw-temporalbench,
  title     = {{TemporalBench: Benchmarking Fine-Grained Temporal Understanding for Multimodal Video Models}},
  author    = {Cai, Mu and Tan, Reuben and Zhang, Jianrui and Zou, Bocheng and Zhang, Kai and Feng, Yao and Zhu, Fangrui and Gu, Jing and Zhong, Yiwu and Shang, Yuzhang and Dou, Yao and Park, Jaden and Gao, Jianfeng and Lee, Yong Jae and Yang, Jianwei},
  booktitle = {NeurIPS 2024 Workshops: Video-Langauge_Models},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/cai2024neuripsw-temporalbench/}
}