MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia

NeurIPS 2024

doi:10.52202/079017-3797 /neurips/2024/zeng2024neurips-mrben/

Abstract

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes. MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zeng et al. "MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs." Neural Information Processing Systems, 2024. doi:10.52202/079017-3797

Markdown

[Zeng et al. "MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zeng2024neurips-mrben/) doi:10.52202/079017-3797

BibTeX

@inproceedings{zeng2024neurips-mrben,
  title     = {{MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs}},
  author    = {Zeng, Zhongshen and Liu, Yinhong and Wan, Yingjia and Li, Jingyao and Chen, Pengguang and Dai, Jianbo and Yao, Yuxuan and Xu, Rongwu and Qi, Zehan and Zhao, Wanru and Shen, Linling and Lu, Jianqiao and Tan, Haochen and Chen, Yukang and Zhang, Hao and Shi, Zhan and Wang, Bailin and Guo, Zhijiang and Jia, Jiaya},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3797},
  url       = {https://mlanthology.org/neurips/2024/zeng2024neurips-mrben/}
}