Multi-Language Diversity Benefits Autoformalization
Abstract
Autoformalization is the task of translating natural language materials into machine-verifiable formalisations. Progress in autoformalization research is hindered by the lack of a sizeable dataset consisting of informal-formal pairs expressing the same essence. Existing methods tend to circumvent this challenge by manually curating small corpora or using few-shot learning with large language models. But these methods suffer from data scarcity and formal language acquisition difficulty. In this work, we create mma, a large, flexible, multi-language, and multi-domain dataset of informal-formal pairs, by using a language model to translate in the reverse direction, that is, from formal mathematical statements into corresponding informal ones. Experiments show that language models fine-tuned on mma can produce up to $29-31$\% of statements acceptable with minimal corrections on the miniF2F and ProofNet benchmarks, up from $0$\% with the base model. We demonstrate that fine-tuning on multi-language formal data results in more capable autoformalization models even on single-language tasks.
Cite
Text
Jiang et al. "Multi-Language Diversity Benefits Autoformalization." Neural Information Processing Systems, 2024. doi:10.52202/079017-2659Markdown
[Jiang et al. "Multi-Language Diversity Benefits Autoformalization." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/jiang2024neurips-multilanguage/) doi:10.52202/079017-2659BibTeX
@inproceedings{jiang2024neurips-multilanguage,
title = {{Multi-Language Diversity Benefits Autoformalization}},
author = {Jiang, Albert Q. and Li, Wenda and Jamnik, Mateja},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2659},
url = {https://mlanthology.org/neurips/2024/jiang2024neurips-multilanguage/}
}