Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction

Abstract

The evaluation of code-generating Large Language Models (LLMs) is fundamentally constrained by two intertwined challenges: a reliance on static, easily contaminated problem sources and the use of superficial, low-rigor testing. This paper introduces a new benchmark construction philosophy, Dual Scaling, designed to systematically address both limitations. Our approach involves continuously scaling the source of problems from dynamic, real-world code repositories and systematically scaling the rigor of tests via automated, high-coverage Property-Based Testing (PBT). We instantiate this philosophy in CODE2BENCH, an end-to-end framework that leverages Scope Graph analysis for principled dependency classification and a 100% branch coverage quality gate to ensure test suite integrity. Using this framework, we construct CODE2BENCH-2509, a new benchmark suite with native instances in both Python and Java. Our extensive evaluation of 10 state-of-the-art LLMs on CODE2BENCH-2509, powered by a novel "diagnostic fingerprint" visualization, yields three key insights: (1) models exhibit a fundamental performance gap, excelling at API application (Weakly Self-Contained tasks) but struggling with algorithmic synthesis (Self-Contained tasks); (2) a model’s performance is profoundly shaped by the target language’s ecosystem, a nuance we are the first to systematically quantify; and (3) our rigorous, scaled testing is critical in uncovering an "illusion of correctness" prevalent in simpler benchmarks. Our work presents a robust, scalable, and diagnostic paradigm for the next generation of LLM evaluation in software engineering. The code, data, and results are available at https://code2bench.github.io/.

Cite

Text

Zhang et al. "Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-code2bench/)

BibTeX

@inproceedings{zhang2026iclr-code2bench,
  title     = {{Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction}},
  author    = {Zhang, Zhe and Liu, Runlin and Liu, Aishan and Liu, Xingyu and Gao, Xiang and Sun, Hailong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-code2bench/}
}