EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Qing, Yuhao; Zhu, Boyu; Du, Mingzhe; Guo, Zhijiang; Zhuo, Terry Yue; Zhang, Qianru; Zhang, Jie M.; Cui, Heming; Yiu, Siu Ming; Huang, Dong; Ng, See-Kiong; Luu, Anh Tuan

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu Ming Yiu, Dong Huang, See-Kiong Ng, Anh Tuan Luu

NeurIPS 2025

/neurips/2025/qing2025neurips-effibenchx/

Abstract

Existing code generation benchmarks primarily evaluate functional correctness, with limited attention to code efficiency, and they are often restricted to a single language such as Python. To address this gap, we introduce EffiBench‑X, the first large‑scale multi‑language benchmark specifically designed for robust efficiency evaluation of LLM‑generated code. EffiBench‑X supports Python, C++, Java, JavaScript, Ruby, and Go, and comprises competitive programming tasks paired with human‑expert solutions as efficiency baselines. Evaluating state‑of‑the‑art LLMs on EffiBench‑X reveals that while models frequently generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM‑generated solutions (e.g., Qwen3‑32B) achieve only around 62% of human efficiency on average, with significant language‑specific variation: models tend to perform better in Python, Ruby, and JavaScript than in Java, C++, and Go (e.g., DeepSeek‑R1’s Python code is markedly more efficient than its Java code). These findings highlight the need for research into optimization‑oriented methods to improve the efficiency of LLM‑generated code across diverse languages. The dataset and evaluation infrastructure are publicly available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Qing et al. "EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code." Advances in Neural Information Processing Systems, 2025.

Markdown

[Qing et al. "EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/qing2025neurips-effibenchx/)

BibTeX

@inproceedings{qing2025neurips-effibenchx,
  title     = {{EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code}},
  author    = {Qing, Yuhao and Zhu, Boyu and Du, Mingzhe and Guo, Zhijiang and Zhuo, Terry Yue and Zhang, Qianru and Zhang, Jie M. and Cui, Heming and Yiu, Siu Ming and Huang, Dong and Ng, See-Kiong and Luu, Anh Tuan},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/qing2025neurips-effibenchx/}
}