Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Abstract

Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of sources: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51\% for TQA on WikiSQL and 22.57\% for MHQA on HotpotQA compared to the fine-tuned baselines.

Cite

Text

Lupidi et al. "Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources." ICLR 2025 Workshops: SSI-FM, 2025.

Markdown

[Lupidi et al. "Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources." ICLR 2025 Workshops: SSI-FM, 2025.](https://mlanthology.org/iclrw/2025/lupidi2025iclrw-source2synth/)

BibTeX

@inproceedings{lupidi2025iclrw-source2synth,
  title     = {{Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources}},
  author    = {Lupidi, Alisia Maria and Gemmell, Carlos and Cancedda, Nicola and Yu, Jane and Weston, Jason E and Foerster, Jakob Nicolaus and Raileanu, Roberta and Lomeli, Maria},
  booktitle = {ICLR 2025 Workshops: SSI-FM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/lupidi2025iclrw-source2synth/}
}