MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation

Abstract

In this work, we introduce a novel evaluation paradigm for Large Language Models (LLMs) that compels them to transition from a traditional question-answering role, akin to a student, to a solution-scoring role, akin to a teacher. This paradigm, focusing on "reasoning about reasoning," termed meta-reasoning, shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation that effectively distinguishes between the cognitive capabilities of different models. Our meta-reasoning process mirrors "system-2" slow thinking, requiring careful examination of assumptions, conditions, calculations, and logic to identify mistakes. This paradigm enables one to transform existed saturated, non-differentiating benchmarks that might be leaked in data pretraining stage to evaluation tools that are both challenging and robust against data contamination. To prove our point, we applied our paradigm to GSM8K dataset and developed the MR-GSM8K benchmark. Our extensive analysis includes several state-of-the-art models from both open-source and commercial domains, uncovering fundamental deficiencies in their training and evaluation methodologies. Specifically, we found the OpenAI o1 models which possess characteristics of "system-2" thinking excel the other SOTA models by more than 20 absolute points in our benchmark, supporting our deficiency hypothesis.

Cite

Text

Zeng et al. "MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation." International Conference on Learning Representations, 2025.

Markdown

[Zeng et al. "MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/zeng2025iclr-mrgsm8k/)

BibTeX

@inproceedings{zeng2025iclr-mrgsm8k,
  title     = {{MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation}},
  author    = {Zeng, Zhongshen and Chen, Pengguang and Liu, Shu and Jiang, Haiyun and Jia, Jiaya},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/zeng2025iclr-mrgsm8k/}
}