SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

Baumgartner, Denver; Kornuta, Tomasz

SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing

NeurIPSW 2024

/neuripsw/2024/baumgartner2024neuripsw-synql/

Abstract

We address the challenge of generating high-quality data for text-to-SQL parsing in low-resource, in-domain scenarios. Although leveraging large language models (LLMs) and in-context learning often achieves the best results in research settings, it is frequently impractical for real-world applications. Therefore, fine-tuning smaller, domain-specific models provides a viable alternative. However, the scarcity of training data frequently constrains it. To overcome this, we introduce SynQL, a novel method for synthetic text-to-SQL data generation tailored for in-domain contexts. We demonstrate the effectiveness of SynQL on the KaggleDBQA benchmark, showing significant performance improvements over models fine-tuned on original data. Additionally, we validate our method on the out-of-domain Spider dataset. We open-source the method and both synthetic datasets.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Baumgartner and Kornuta. "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing." NeurIPS 2024 Workshops: TRL, 2024.

Markdown

[Baumgartner and Kornuta. "SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing." NeurIPS 2024 Workshops: TRL, 2024.](https://mlanthology.org/neuripsw/2024/baumgartner2024neuripsw-synql/)

BibTeX

@inproceedings{baumgartner2024neuripsw-synql,
  title     = {{SynQL: Synthetic Data Generation for In-Domain, Low-Resource Text-to-SQL Parsing}},
  author    = {Baumgartner, Denver and Kornuta, Tomasz},
  booktitle = {NeurIPS 2024 Workshops: TRL},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/baumgartner2024neuripsw-synql/}
}