CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

Abstract

We present CRUXEVAL (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input pre- diction and output prediction. First, we propose a general recipe for generating our execution benchmark by sampling from a model, which can be used for more challenging versions of the benchmark if needed. Second, we evaluate twenty code models on our benchmark and discover that many recent high-scoring models on HumanEval show no improvements on our benchmark. Third, we show that simple CoT and fine-tuning schemes can improve performance on our benchmark but remain far from solving it. The best setup, GPT-4 with chain of thought (CoT), achieves a pass@1 of 75% and 81% on input and output prediction, respectively. In contrast, Code Llama 34B achieves a pass@1 of 50% and 46% on input and output prediction. When it comes to reasoning about code, GPT-4 has a huge edge over other models but still fails consistently on some surprisingly simple Python programs.

Cite

Text

Gu et al. "CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution." ICLR 2024 Workshops: DPFM, 2024.

Markdown

[Gu et al. "CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution." ICLR 2024 Workshops: DPFM, 2024.](https://mlanthology.org/iclrw/2024/gu2024iclrw-cruxeval/)

BibTeX

@inproceedings{gu2024iclrw-cruxeval,
  title     = {{CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution}},
  author    = {Gu, Alex and Roziere, Baptiste and Leather, Hugh James and Solar-Lezama, Armando and Synnaeve, Gabriel and Wang, Sida},
  booktitle = {ICLR 2024 Workshops: DPFM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/gu2024iclrw-cruxeval/}
}