GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations

Jinhao Duan, Renming Zhang, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, Kaidi Xu

NeurIPS 2024

doi:10.52202/079017-0885 /neurips/2024/duan2024neurips-gtbench/

Abstract

As Large Language Models (LLMs) are integrated into critical real-world applications, their strategic and logical reasoning abilities are increasingly crucial. This paper evaluates LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games that require pure logic and strategic reasoning to compete with opponents. We first propose GTBench, a language-driven environment composing 10 widely-recognized tasks, across a comprehensive game taxonomy: complete versus incomplete information, dynamic versus static, and probabilistic versus deterministic scenarios. Then, we (1) Characterize the game-theoretic reasoning of LLMs; and (2) Perform LLM-vs.-LLM competitions as reasoning evaluation. We observe that (1) LLMs have distinct behaviors regarding various gaming scenarios; for example, LLMs fail in complete and deterministic games yet they are competitive in probabilistic gaming scenarios; (2) Most open-source LLMs, e.g., CodeLlama-34b-Instruct and Llama-2-70b-chat, are less competitive than commercial LLMs, e.g., GPT-4, in complex games, yet the recently released Llama-3-70b-Instruct makes up for this shortcoming. In addition, code-pretraining greatly benefits strategic reasoning, while advanced reasoning methods such as Chain-of-Thought (CoT) and Tree-of-Thought (ToT) do not always help. We further characterize the game-theoretic properties of LLMs, such as equilibrium and Pareto Efficiency in repeated games. Detailed error profiles are provided for a better understanding of LLMs' behavior. We hope our research provides standardized protocols and serves as a foundation to spur further explorations in the strategic reasoning of LLMs.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Duan et al. "GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations." Neural Information Processing Systems, 2024. doi:10.52202/079017-0885

Markdown

[Duan et al. "GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/duan2024neurips-gtbench/) doi:10.52202/079017-0885

BibTeX

@inproceedings{duan2024neurips-gtbench,
  title     = {{GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations}},
  author    = {Duan, Jinhao and Zhang, Renming and Diffenderfer, James and Kailkhura, Bhavya and Sun, Lichao and Stengel-Eskin, Elias and Bansal, Mohit and Chen, Tianlong and Xu, Kaidi},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0885},
  url       = {https://mlanthology.org/neurips/2024/duan2024neurips-gtbench/}
}