Synthetic Bootstrapped Pretraining

Yang, Zitong; Zhang, Aonan; Liu, Hong; Hashimoto, Tatsunori; Candes, Emmanuel; Wang, Chong; Pang, Ruoming

Synthetic Bootstrapped Pretraining

Zitong Yang, Aonan Zhang, Hong Liu, Tatsunori Hashimoto, Emmanuel Candes, Chong Wang, Ruoming Pang

ICLR 2026

/iclr/2026/yang2026iclr-synthetic/

Abstract

We introduce Synthetic Bootstrapped Pretraining (SBP), a language model (LM) pretraining procedure that first learns a model of relations between documents from the pretraining dataset and then leverages it to synthesize a vast new corpus for joint training. While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance. We validate SBP by designing a compute-matched pretraining setup and pretrain a 3B-parameter and a 6B-parameter model on up to 1T tokens from scratch. We find SBP consistently improves upon a strong repetition baseline and delivers up to 60% of performance improvement attainable by an oracle upper bound with access to 20x more unique data. Qualitative analysis reveals that the synthesized documents go beyond mere paraphrases -- SBP first abstracts a core concept from the seed material and then crafts a new narration on top of it. Besides strong empirical performance, SBP admits a natural Bayesian interpretation: the synthesizer implicitly learns to abstract the latent concepts shared between related documents.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yang et al. "Synthetic Bootstrapped Pretraining." International Conference on Learning Representations, 2026.

Markdown

[Yang et al. "Synthetic Bootstrapped Pretraining." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yang2026iclr-synthetic/)

BibTeX

@inproceedings{yang2026iclr-synthetic,
  title     = {{Synthetic Bootstrapped Pretraining}},
  author    = {Yang, Zitong and Zhang, Aonan and Liu, Hong and Hashimoto, Tatsunori and Candes, Emmanuel and Wang, Chong and Pang, Ruoming},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yang2026iclr-synthetic/}
}