PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Majdinasab, Vahid; Nikanjam, Amin; Khomh, Foutse

PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Vahid Majdinasab, Amin Nikanjam, Foutse Khomh

TMLR 2026

/tmlr/2026/majdinasab2026tmlr-prismbench/

Abstract

The rapid advancement of LLMs' code generation capabilities is outpacing traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. To address these issues, we introduce PrismBench, a multi-agent, dynamic benchmarking framework designed to systematically expose and analyze LLM failure modes in code generation tasks. We formulate evaluation as a Markov Decision Process over a structured tree of coding challenges, leveraging a customized Monte Carlo Tree Search algorithm to traverse this tree and discover high-failure scenarios. Our multi-agent setup orchestrates task generation, model response, and analysis, enabling scalable assessment across diverse coding challenges. Additionally, we propose metrics that combine structural traversal patterns with performance across different tasks and difficulty levels to enable diagnostic and systematic comparison of LLMs' performance. We conduct extensive experiments on eight state-of-the-art LLMs and analyze how model architecture and scale influence code generation performance across varying coding tasks. All code, evaluation trees, and a public leaderboard are available at https://prismbench.github.io/Demo/

PDF TMLR OpenReview Code Semantic Scholar

Cite

Text

Majdinasab et al. "PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search." Transactions on Machine Learning Research, 2026.

Markdown

[Majdinasab et al. "PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/majdinasab2026tmlr-prismbench/)

BibTeX

@article{majdinasab2026tmlr-prismbench,
  title     = {{PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search}},
  author    = {Majdinasab, Vahid and Nikanjam, Amin and Khomh, Foutse},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/majdinasab2026tmlr-prismbench/}
}