Lmgame-Bench: How Good Are LLMs at Playing Games?

Abstract

Playing video games requires perception, reasoning, memory, and long-horizon planning—exactly the faculties expected of modern large language and vision–language models (LLMs/VLMs). We introduce LMGame-Bench, a benchmark built on six popular games spanning platformer, puzzle, and narrative games through a unified Gym‑style API. Unlike prior game benchmarks that entangle multiple skills, LMGame-Bench employs a modular harness—including perception, memory, and reasoning modules—that can be toggled to selectively probe distinct capabilities. The benchmark further improves robustness through prompt standardization and contamination mitigation. Evaluation of 13 state-of-the-art models demonstrates that LMGame-Bench remains challenging yet effectively discriminates among models. Correlation analysis reveals that individual games align with core LLM capabilities, providing a quantitative framework for interpreting performance. Finally, LMGame-Bench exposes models’ limitations in visual state extraction, reflection, spatiotemporal reasoning, and long-context reasoning, pointing to concrete directions for model improvement.

Cite

Text

Hu et al. "Lmgame-Bench: How Good Are LLMs at Playing Games?." International Conference on Learning Representations, 2026.

Markdown

[Hu et al. "Lmgame-Bench: How Good Are LLMs at Playing Games?." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/hu2026iclr-lmgamebench/)

BibTeX

@inproceedings{hu2026iclr-lmgamebench,
  title     = {{Lmgame-Bench: How Good Are LLMs at Playing Games?}},
  author    = {Hu, Lanxiang and Huo, Mingjia and Zhang, Yuxuan and Yu, Haoyang and Xing, Eric P. and Stoica, Ion and Rosing, Tajana and Jin, Haojian and Zhang, Hao},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/hu2026iclr-lmgamebench/}
}