TimE: A Multi-Level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Abstract

Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TimE, designed for temporal reasoning in real-world scenarios. TimE consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TimE-Wiki, TimE-News, and TimE-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TimE-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning.

Cite

Text

Wei et al. "TimE: A Multi-Level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wei et al. "TimE: A Multi-Level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wei2025neurips-time/)

BibTeX

@inproceedings{wei2025neurips-time,
  title     = {{TimE: A Multi-Level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios}},
  author    = {Wei, Shaohang and Li, Wei and Song, Feifan and Luo, Wen and Zhuang, Tianyi and Tan, Haochen and Guo, Zhijiang and Wang, Houfeng},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wei2025neurips-time/}
}