DafnyBench: A Benchmark for Formal Software Verification

Abstract

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough annotations for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and annotations. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

Cite

Text

Loughridge et al. "DafnyBench: A Benchmark for Formal Software Verification." Transactions on Machine Learning Research, 2025.

Markdown

[Loughridge et al. "DafnyBench: A Benchmark for Formal Software Verification." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/loughridge2025tmlr-dafnybench/)

BibTeX

@article{loughridge2025tmlr-dafnybench,
  title     = {{DafnyBench: A Benchmark for Formal Software Verification}},
  author    = {Loughridge, Chloe R and Sun, Qinyi and Ahrenbach, Seth and Cassano, Federico and Sun, Chuyue and Sheng, Ying and Mudide, Anish and Misu, Md Rakib Hossain and Amin, Nada and Tegmark, Max},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/loughridge2025tmlr-dafnybench/}
}