[Tiny] Parameterized Synthetic Text Generation with SimpleStories
Abstract
We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million stories each in English and Japanese. Our method employs parametrization of prompts with features at multiple levels of abstraction, allowing for systematic control over story characteristics to ensure broad syntactic and semantic diversity. Building on and addressing limitations in the TinyStories dataset, our approach demonstrates that simplicity and variety can be achieved simultaneously in synthetic text generation at scale.
Cite
Text
Finke et al. "[Tiny] Parameterized Synthetic Text Generation with SimpleStories." ICLR 2025 Workshops: SynthData, 2025.Markdown
[Finke et al. "[Tiny] Parameterized Synthetic Text Generation with SimpleStories." ICLR 2025 Workshops: SynthData, 2025.](https://mlanthology.org/iclrw/2025/finke2025iclrw-tiny/)BibTeX
@inproceedings{finke2025iclrw-tiny,
title = {{[Tiny] Parameterized Synthetic Text Generation with SimpleStories}},
author = {Finke, Lennart and Dooms, Thomas and Allen, Mat and Rodriguez, Juan Diego and Nabeshima, Noa and Braun, Dan},
booktitle = {ICLR 2025 Workshops: SynthData},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/finke2025iclrw-tiny/}
}