ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning
Abstract
Evaluating large language models (LLMs) poses significant challenges, particularly due to issues of data contamination and the leakage of correct answers. To address these challenges, we introduce ThinkBench, a novel evaluation framework designed to robustly evaluate the reasoning capability of LLMs. ThinkBench proposes a dynamic data generation method for constructing out-of-distribution (OOD) datasets and offers an OOD dataset that contains 2,912 samples drawn from reasoning tasks. ThinkBench unifies the evaluation of reasoning models and non-reasoning models. We evaluate 16 LLMs and 4 PRMs under identical experimental conditions and show that most of the LLMs' performance are far from robust and they face a certain level of data leakage. By dynamically generating OOD datasets, ThinkBench effectively provides a reliable evaluation of LLMs and reduces data contamination impact. Our data and codes are available at https://github.com/huangshulin123/ThinkBench.
Cite
Text
Huang et al. "ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning." Advances in Neural Information Processing Systems, 2025.Markdown
[Huang et al. "ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/huang2025neurips-thinkbench/)BibTeX
@inproceedings{huang2025neurips-thinkbench,
title = {{ThinkBench: Dynamic Out-of-Distribution Evaluation for Robust LLM Reasoning}},
author = {Huang, Shulin and Yang, Linyi and Song, Yan and Chen, Shuang and Cui, Leyang and Wan, Ziyu and Zeng, Qingcheng and Wen, Ying and Shao, Kun and Zhang, Weinan and Wang, Jun and Zhang, Yue},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/huang2025neurips-thinkbench/}
}