Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Fujii, Kazuki; Tajima, Yukito; Mizuki, Sakae; Kawamura, Masaki; Shimada, Hinari; Shiotani, Taihei; Saito, Koshiro; Oi, Masanari; Nakamura, Taishi; Okamoto, Takumi; Ishida, Shigeki; Hattori, Kakeru; Ma, Youmi; Takamura, Hiroya; Yokota, Rio; Sakuma, Jun; Okazaki, Naoaki

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki

ICLR 2026

/iclr/2026/fujii2026iclr-rewriting/

Abstract

The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed pre-training datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode ($\approx$16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach refines low-quality code, maximizing data utility. SwallowMath ($\approx$2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +16.1 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting yielding the largest gains. By releasing datasets, prompts, checkpoints, and pipeline code, we ensure reproducibility and provide a transferable transform-and-retain methodology that can be adapted to other base models and LLM rewriting setups.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Fujii et al. "Rewriting Pre-Training Data Boosts LLM Performance in Math and Code." International Conference on Learning Representations, 2026.

Markdown

[Fujii et al. "Rewriting Pre-Training Data Boosts LLM Performance in Math and Code." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/fujii2026iclr-rewriting/)

BibTeX

@inproceedings{fujii2026iclr-rewriting,
  title     = {{Rewriting Pre-Training Data Boosts LLM Performance in Math and Code}},
  author    = {Fujii, Kazuki and Tajima, Yukito and Mizuki, Sakae and Kawamura, Masaki and Shimada, Hinari and Shiotani, Taihei and Saito, Koshiro and Oi, Masanari and Nakamura, Taishi and Okamoto, Takumi and Ishida, Shigeki and Hattori, Kakeru and Ma, Youmi and Takamura, Hiroya and Yokota, Rio and Sakuma, Jun and Okazaki, Naoaki},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/fujii2026iclr-rewriting/}
}