HARDMATH: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Jonah Brenner, Danxian Liu, Nianli Peng, Corey Wang, Michael Brenner

NeurIPSW 2024

/neuripsw/2024/fan2024neuripsw-hardmath/

Abstract

Advanced applied mathematics problems are not well-represented in existing benchmarking datasets used to evaluate Large Language Models (LLMs). To address this, we introduce **HARDMath**, the Harvard Approximate Reasoning Dataset for Mathematics—a dataset of 1,466 difficult problems inspired by Harvard University’s graduate course on asymptotic methods. The dataset contains a diverse set of challenging applied mathematics problems with worked solutions that employ various analytical approximation methods. Developing such solutions typically requires multiple modes of analysis—including mathematical reasoning, the use of computational tools, and subjective judgment—making this a challenging problem for LLMs. We establish a framework that auto-generates an arbitrarily large number of ‘hard’ applied mathematics problems with approximate analytical solutions that include validity checks against numerical ground-truths. We evaluate frontier LLMs on **HARDMath-mini**, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts. Even leading closed-source models like GPT-4 achieve only 43.8% overall accuracy with few-shot Chain-of-Thought prompting, and all models demonstrate significantly lower performance compared to results on existing mathematics benchmark datasets. We additionally conduct a detailed error analysis to gain insights into the failure cases of LLMs. These results demonstrate limitations of current LLM performance on advanced graduate-level asymptotic math problems and underscore the importance of datasets like **HARDMath** to advance mathematical abilities of LLMs.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Fan et al. "HARDMATH: A Benchmark Dataset for Challenging Problems in Applied Mathematics." NeurIPS 2024 Workshops: MATH-AI, 2024.

Markdown

[Fan et al. "HARDMATH: A Benchmark Dataset for Challenging Problems in Applied Mathematics." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/fan2024neuripsw-hardmath/)

BibTeX

@inproceedings{fan2024neuripsw-hardmath,
  title     = {{HARDMATH: A Benchmark Dataset for Challenging Problems in Applied Mathematics}},
  author    = {Fan, Jingxuan and Martinson, Sarah and Wang, Erik Y. and Hausknecht, Kaylie and Brenner, Jonah and Liu, Danxian and Peng, Nianli and Wang, Corey and Brenner, Michael},
  booktitle = {NeurIPS 2024 Workshops: MATH-AI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/fan2024neuripsw-hardmath/}
}