SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing
Abstract
We address the challenge of generating high-quality data for text-to-SQL parsing in low-resource, in-domain scenarios. Although leveraging large language models (LLMs) and in-context learning often achieves the best results in research settings, it is frequently impractical for real-world applications. Therefore, fine-tuning smaller, domain-specific models provides a viable alternative. However, the scarcity of training data frequently constrains it. To overcome this, we introduce SynQL, a novel method for synthetic text-to-SQL data generation tailored for in-domain contexts. We demonstrate the effectiveness of SynQL on the KaggleDBQA benchmark, showing significant performance improvements over models fine-tuned on original data. Additionally, we validate our method on the out-of-domain Spider dataset. We open-source the method and both synthetic datasets.
Cite
Text
Baumgartner and Kornuta. "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing." NeurIPS 2024 Workshops: TRL, 2024.Markdown
[Baumgartner and Kornuta. "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing." NeurIPS 2024 Workshops: TRL, 2024.](https://mlanthology.org/neuripsw/2024/baumgartner2024neuripsw-synql/)BibTeX
@inproceedings{baumgartner2024neuripsw-synql,
title = {{SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing}},
author = {Baumgartner, Denver and Kornuta, Tomasz},
booktitle = {NeurIPS 2024 Workshops: TRL},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/baumgartner2024neuripsw-synql/}
}