Parameterized Synthetic Text Generation with SimpleStories

Abstract

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity. Ablations on a newly trained tiny model suite then show improved sample efficiency and model interpretability in comparison with the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process. As a byproduct, we move the frontier with regards to the fewest-parameter language model that outputs grammatical English.

Cite

Text

Finke et al. "Parameterized Synthetic Text Generation with SimpleStories." Advances in Neural Information Processing Systems, 2025.

Markdown

[Finke et al. "Parameterized Synthetic Text Generation with SimpleStories." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/finke2025neurips-parameterized/)

BibTeX

@inproceedings{finke2025neurips-parameterized,
  title     = {{Parameterized Synthetic Text Generation with SimpleStories}},
  author    = {Finke, Lennart and Sreedhara, Chandan and Dooms, Thomas and Allen, Mat and Rodriguez, Juan Diego and Nabeshima, Noa and Marshall, Thomas and Braun, Dan},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/finke2025neurips-parameterized/}
}