Advancing Theorem Proving in LLMs Through Large-Scale Synthetic Data

Huajian Xin, Daya Guo, Zhihong Shao, Z.Z. Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, Xiaodan Liang

NeurIPSW 2024

/neuripsw/2024/xin2024neuripsw-advancing/

Abstract

Proof assistants like Lean have revolutionized mathematical proof verification by providing high levels of accuracy and reliability. Although large language models (LLMs) have demonstrated potential in mathematical reasoning, their advancement in formal theorem proving is hindered by the scarcity of large, high-quality training datasets. To address this challenge, we present a novel approach to generate extensive Lean 4 proof data from natural language mathematical problems at the high school and undergraduate levels. Specifically, we synthesize 8 million formal statements with corresponding proofs, leveraging this dataset to fine-tune the DeepSeekMath 7B model. The resulting model, DeepSeek-Prover, achieves a pass rate of 50\% on the Lean 4 miniF2F benchmark, surpassing the previous state-of-the-art result of 41.0\%. These findings underscore the potential of large-scale synthetic data in significantly enhancing the theorem-proving capabilities of LLMs.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Xin et al. "Advancing Theorem Proving in LLMs Through Large-Scale Synthetic Data." NeurIPS 2024 Workshops: MATH-AI, 2024.

Markdown

[Xin et al. "Advancing Theorem Proving in LLMs Through Large-Scale Synthetic Data." NeurIPS 2024 Workshops: MATH-AI, 2024.](https://mlanthology.org/neuripsw/2024/xin2024neuripsw-advancing/)

BibTeX

@inproceedings{xin2024neuripsw-advancing,
  title     = {{Advancing Theorem Proving in LLMs Through Large-Scale Synthetic Data}},
  author    = {Xin, Huajian and Guo, Daya and Shao, Zhihong and Ren, Z.Z. and Zhu, Qihao and Liu, Bo and Ruan, Chong and Li, Wenda and Liang, Xiaodan},
  booktitle = {NeurIPS 2024 Workshops: MATH-AI},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/xin2024neuripsw-advancing/}
}