PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search
Abstract
The rapid advancement of LLMs' code generation capabilities is outpacing traditional evaluation methods. Static benchmarks fail to capture the depth and breadth of LLM capabilities and eventually become obsolete, while most dynamic approaches either rely too heavily on LLM-based evaluation or remain constrained by predefined test sets. To address these issues, we introduce PrismBench, a multi-agent, dynamic benchmarking framework designed to systematically expose and analyze LLM failure modes in code generation tasks. We formulate evaluation as a Markov Decision Process over a structured tree of coding challenges, leveraging a customized Monte Carlo Tree Search algorithm to traverse this tree and discover high-failure scenarios. Our multi-agent setup orchestrates task generation, model response, and analysis, enabling scalable assessment across diverse coding challenges. Additionally, we propose metrics that combine structural traversal patterns with performance across different tasks and difficulty levels to enable diagnostic and systematic comparison of LLMs' performance. We conduct extensive experiments on eight state-of-the-art LLMs and analyze how model architecture and scale influence code generation performance across varying coding tasks. All code, evaluation trees, and a public leaderboard are available at https://prismbench.github.io/Demo/
Cite
Text
Majdinasab et al. "PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search." Transactions on Machine Learning Research, 2026.Markdown
[Majdinasab et al. "PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/majdinasab2026tmlr-prismbench/)BibTeX
@article{majdinasab2026tmlr-prismbench,
title = {{PrismBench: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search}},
author = {Majdinasab, Vahid and Nikanjam, Amin and Khomh, Foutse},
journal = {Transactions on Machine Learning Research},
year = {2026},
url = {https://mlanthology.org/tmlr/2026/majdinasab2026tmlr-prismbench/}
}