Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Abstract

In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark, to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents real-world optimization problems that require multistep reasoning to build their mathematical models. Our evaluations of various open-source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs’ generalization capabilities, providing insights for future research in this area. The dataset and evaluation code are publicly available.

Cite

Text

Mostajabdaveh et al. "Evaluating LLM Reasoning in the Operations Research Domain with ORQA." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I23.34673

Markdown

[Mostajabdaveh et al. "Evaluating LLM Reasoning in the Operations Research Domain with ORQA." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/mostajabdaveh2025aaai-evaluating/) doi:10.1609/AAAI.V39I23.34673

BibTeX

@inproceedings{mostajabdaveh2025aaai-evaluating,
  title     = {{Evaluating LLM Reasoning in the Operations Research Domain with ORQA}},
  author    = {Mostajabdaveh, Mahdi and Yu, Timothy Tin Long and Dash, Samarendra Chandan Bindu and Ramamonjison, Rindra and Byusa, Jabo Serge and Carenini, Giuseppe and Zhou, Zirui and Zhang, Yong},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {24902-24910},
  doi       = {10.1609/AAAI.V39I23.34673},
  url       = {https://mlanthology.org/aaai/2025/mostajabdaveh2025aaai-evaluating/}
}