SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models

Abstract

Although the capabilities of Large Language Models and Large Reasoning Models have been increasingly tested on complex reasoning tasks, their long-horizon planning abilities have not yet been extensively investigated. In this work, we provide a systematic assessment of the planning and long-horizon reasoning capabilities of state-of-the-art Large Reasoning Models (LRMs). We propose a novel benchmark based on Sokoban puzzles, intentionally simplified to isolate long-horizon planning from state persistence. Our findings reveal a consistent degradation in planning performance when more than 25 moves are required to reach the solution, suggesting non-recoverable error accumulation under single-pass autoregressive decoding. We show that equipping LRMs with Planning Domain Definition Language (PDDL) parsing, validation, and solving tools allows for modest improvements, suggesting that character level counting and long yet simple state tracking might not be overcome by test-time scaling approaches alone.

Cite

Text

Monti et al. "SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models." Transactions on Machine Learning Research, 2026.

Markdown

[Monti et al. "SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/monti2026tmlr-sokobench/)

BibTeX

@article{monti2026tmlr-sokobench,
  title     = {{SokoBench: Evaluating Long-Horizon Planning and Reasoning in Large Language Models}},
  author    = {Monti, Sebastiano and Nicolini, Carlo and Pellegrini, Giovanni and Staiano, Jacopo and Lepri, Bruno},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/monti2026tmlr-sokobench/}
}