WildChat-50m: A Deep Dive into the Role of Synthetic Data in Post-Training

Abstract

Language model (LLM) post-training can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WildChat-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating Re-Wild, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples.

Cite

Text

Feuer and Hegde. "WildChat-50m: A Deep Dive into the Role of Synthetic Data in Post-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Feuer and Hegde. "WildChat-50m: A Deep Dive into the Role of Synthetic Data in Post-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/feuer2025icml-wildchat50m/)

BibTeX

@inproceedings{feuer2025icml-wildchat50m,
  title     = {{WildChat-50m: A Deep Dive into the Role of Synthetic Data in Post-Training}},
  author    = {Feuer, Benjamin and Hegde, Chinmay},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {17100-17130},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/feuer2025icml-wildchat50m/}
}