STEM-PoM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

Abstract

Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM documents. While LLMs can generate equations or solve math-related queries, their ability to fully understand and interpret abstract mathematical symbols in long, math-rich documents remains limited. In this paper, we introduce STEM-PoM, a comprehensive benchmark dataset designed to evaluate LLMs' reasoning abilities on math symbols within contextual scientific text. The dataset, sourced from real-world ArXiv documents, contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors, with additional sub-attributes including scalar/vector/matrix for variables and local/global/discipline-specific labels for both constants and operators. Our extensive experiments show that state-of-the-art LLMs achieve an average of 20-60\% accuracy under in-context learning and 50-60\% accuracy with fine-tuning, revealing a significant gap in their mathematical reasoning capabilities. STEM-PoM fuels future research of developing advanced Math-AI models that can robustly handle math symbols.

Cite

Text

Zou et al. "STEM-PoM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing." NeurIPS 2024 Workshops: MATH-AI, 2024.

Markdown

[Zou et al. "STEM-PoM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/zou2024neuripsw-stempom/)

BibTeX

@inproceedings{zou2024neuripsw-stempom,
  title     = {{STEM-PoM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing}},
  author    = {Zou, Jiaru and Wang, Qing and Thakur, Pratyush and Kani, Nickvash},
  booktitle = {NeurIPS 2024 Workshops: MATH-AI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/zou2024neuripsw-stempom/}
}