Autonomous Data Selection with Language Models for Mathematical Texts

Abstract

To improve language models’ proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base language models for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter language model on our curated dataset, achieving substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders of magnitude compared to previous continual pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines, underscoring the potential of our approach in enhancing models’ mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText.

Cite

Text

Zhang et al. "Autonomous Data Selection with Language Models for Mathematical Texts." ICLR 2024 Workshops: DPFM, 2024.

Markdown

[Zhang et al. "Autonomous Data Selection with Language Models for Mathematical Texts." ICLR 2024 Workshops: DPFM, 2024.](https://mlanthology.org/iclrw/2024/zhang2024iclrw-autonomous/)

BibTeX

@inproceedings{zhang2024iclrw-autonomous,
  title     = {{Autonomous Data Selection with Language Models for Mathematical Texts}},
  author    = {Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
  booktitle = {ICLR 2024 Workshops: DPFM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/zhang2024iclrw-autonomous/}
}