GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Abstract

Large language models (LLMs) have demonstrated remarkable few-shot performance on many natural language understanding tasks. Despite several demonstrations of using large language models in complex, strategic scenarios, there lacks a comprehensive framework for evaluating agents’ performance across various types of reasoning found in games. To address this gap, we introduce GameBench, a cross-domain benchmark for evaluating strategic reasoning abilities of LLM agents. We focus on 9 different game environments, where each covers at least one axis of key reasoning skill identified in strategy games, and select games for which strategy explanations are unlikely to form a significant portion of models' pretraining corpuses. Our evaluations use GPT-3.5 and GPT-4 in their base form along with two scaffolding frameworks designed to enhance strategic reasoning ability: Chain-of-Thought (CoT) prompting and Reasoning Via Planning (RAP). Our results show that none of the tested models match human performance, and at worse GPT-4 performs worse than random action. CoT and RAP both improve scores but not to comparable human levels. Benchmark code is available at https://anonymous.4open.science/r/GameBench-5942/.

Cite

Text

Costarelli et al. "GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents." NeurIPS 2024 Workshops: LanGame, 2024.

Markdown

[Costarelli et al. "GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents." NeurIPS 2024 Workshops: LanGame, 2024.](https://mlanthology.org/neuripsw/2024/costarelli2024neuripsw-gamebench/)

BibTeX

@inproceedings{costarelli2024neuripsw-gamebench,
  title     = {{GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents}},
  author    = {Costarelli, Anthony and Allen, Mat and Hauksson, Roman and Sodunke, Grace and Hariharan, Suhas and Cheng, Carlson and Li, Wenjie and Clymer, Joshua M and Yadav, Arjun},
  booktitle = {NeurIPS 2024 Workshops: LanGame},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/costarelli2024neuripsw-gamebench/}
}