Not All LLM Reasoners Are Created Equal

Abstract

We study the depth of problem-solving capabilities of LLMs, and to what extent they perform mathematical reasoning in a compositional manner. To this end, we create a new benchmark by composing pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. We measure the difference between the performance of solving each question independently and solving the compositional pairs as the reasoning gap of a model. Our findings reveal a significant reasoning gap in most frontier LLMs. This gap is more pronounced in smaller and more cost-efficient models. The objective of this study is not to introduce yet another benchmark, but rather to provide a case study aimed at gaining deeper insights into current models' reasoning abilities, and to reassess existing established training methods and benchmarks.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Hosseini et al. "Not All LLM Reasoners Are Created Equal." NeurIPS 2024 Workshops: MATH-AI, 2024.

Markdown

[Hosseini et al. "Not All LLM Reasoners Are Created Equal." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/hosseini2024neuripsw-all/)

BibTeX

@inproceedings{hosseini2024neuripsw-all,
  title     = {{Not All LLM Reasoners Are Created Equal}},
  author    = {Hosseini, Arian and Sordoni, Alessandro and Toyama, Daniel Kenji and Courville, Aaron and Agarwal, Rishabh},
  booktitle = {NeurIPS 2024 Workshops: MATH-AI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/hosseini2024neuripsw-all/}
}