MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Abstract
High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of “less is more”, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates and conducted continual pre-training experiments, booting the performance on common mathematical reasoning benchmarks. We aim for our MathPile to boost language models’ mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.
Cite
Text
Wang et al. "MathPile: A Billion-Token-Scale Pretraining Corpus for Math." Neural Information Processing Systems, 2024. doi:10.52202/079017-0801Markdown
[Wang et al. "MathPile: A Billion-Token-Scale Pretraining Corpus for Math." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/wang2024neurips-mathpile/) doi:10.52202/079017-0801BibTeX
@inproceedings{wang2024neurips-mathpile,
title = {{MathPile: A Billion-Token-Scale Pretraining Corpus for Math}},
author = {Wang, Zengzhi and Li, Xuefeng and Xia, Rui and Liu, Pengfei},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0801},
url = {https://mlanthology.org/neurips/2024/wang2024neurips-mathpile/}
}