HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class
Abstract
Large language models (LLMs) have shown remarkable progress in mathematical problem-solving, but evaluation has largely focused on problems that have exact analytical solutions or involve formal proofs, often overlooking approximation-based problems ubiquitous in applied science and engineering. To fill this gap, we build on prior work and present $\textbf{HARDMath2}$, a dataset of 211 original problems covering the core topics in an introductory graduate applied math class, including boundary-layer analysis, WKB methods, asymptotic solutions of nonlinear partial differential equations, and the asymptotics of oscillatory integrals. This dataset was designed and verified by the students and instructors of a core graduate applied mathematics course at Harvard. We build the dataset through a novel collaborative environment that challenges students to write and refine difficult problems consistent with the class syllabus, peer-validate solutions, test different models, and automatically check LLM-generated solutions against their own answers and numerical ground truths. Evaluation results show that leading frontier models still struggle with many of the problems in the dataset, highlighting a gap in the mathematical reasoning skills of current LLMs. Importantly, students identified strategies to create increasingly difficult problems by interacting with the models and exploiting common failure modes. This back-and-forth with the models not only resulted in a richer and more challenging benchmark but also led to qualitative improvements in the students' understanding of the course material, which is increasingly important as we enter an age where state-of-the-art language models can solve many challenging problems across a wide domain of fields.
Cite
Text
Roggeveen et al. "HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class." Advances in Neural Information Processing Systems, 2025.Markdown
[Roggeveen et al. "HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/roggeveen2025neurips-hardmath2/)BibTeX
@inproceedings{roggeveen2025neurips-hardmath2,
title = {{HARDMath2: A Benchmark for Applied Mathematics Built by Students as Part of a Graduate Class}},
author = {Roggeveen, James V and Wang, Erik Y. and Ettel, David and Flintoft, Will and Donets, Peter and Ward, Raglan and Roman, Ahmed and Graf, Anton Marius and Dandavate, Siddharth and Williamson, Ava and Yeung, Felix and Migacz, Kacper K and Wang, Yijun and Bostan, Egemen and Nguyen, Duy Thuc and He, Zhe and Descoteaux, Marc L. and Mykland, Anne and Liu, Shida and Ponce, Jorge García and Zhu, Luke and Chen, Yuyang and Ivshina, Ekaterina S. and Fernandez, Miguel and Kim, Minjae and Gumbs, Kennan and Tan, Matthew Scott and Yang, Russell and Hoang, Mai and Brown, David and Silveira, Isabella A and Sykes, Lavon and Nageswaran, Arjun and Fredenberg, William and Chen, Yiming and Martin, Lucas and Tang, Yixing and Smith, Kelly Werker and Liao, Hongyu and Wilson, Logan G. and Cai, Alexander Dazhen and Nathwani, Lucy S. and Gutierrez, Nickholas and Biju, Andrea Elizabeth and Brenner, Michael},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/roggeveen2025neurips-hardmath2/}
}