IsarStep: A Benchmark for High-Level Mathematical Reasoning

Abstract

A well-defined benchmark is essential for measuring and accelerating research progress of machine learning models. In this paper, we present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models. We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover. The dataset has a broad coverage of undergraduate and research-level mathematical and computer science theorems. In our defined task, a model is required to fill in a missing intermediate proposition given surrounding proofs. This task provides a starting point for the long-term goal of having machines generate human-readable proofs automatically. Our experiments and analysis reveal that while the task is challenging, neural models can capture non-trivial mathematical reasoning. We further design a hierarchical transformer that outperforms the transformer baseline.

Cite

Text

Li et al. "IsarStep: A Benchmark for High-Level Mathematical Reasoning." International Conference on Learning Representations, 2021.

Markdown

[Li et al. "IsarStep: A Benchmark for High-Level Mathematical Reasoning." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/li2021iclr-isarstep/)

BibTeX

@inproceedings{li2021iclr-isarstep,
  title     = {{IsarStep: A Benchmark for High-Level Mathematical Reasoning}},
  author    = {Li, Wenda and Yu, Lei and Wu, Yuhuai and Paulson, Lawrence C.},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/li2021iclr-isarstep/}
}