Importance of Synthesizing High-Quality Data for Text-to-SQL Parsing

Zhao, Yiyun; Jiang, Jiarong; Hu, Yiqun; Lan, Wuwei; Zhu, Henghui; Chauhan, Anuj; Li, Alexander Hanbo; Pan, Lin; Wang, Jun; Hang, Chung-Wei; Zhang, Sheng; Dong, Mingwen; Lilien, Joseph; Ng, Patrick; Wang, Zhiguo; Castelli, Vittorio; Xiang, Bing

Importance of Synthesizing High-Quality Data for Text-to-SQL Parsing

Yiyun Zhao, Jiarong Jiang, Yiqun Hu, Wuwei Lan, Henghui Zhu, Anuj Chauhan, Alexander Hanbo Li, Lin Pan, Jun Wang, Chung-Wei Hang, Sheng Zhang, Mingwen Dong, Joseph Lilien, Patrick Ng, Zhiguo Wang, Vittorio Castelli, Bing Xiang

NeurIPSW 2022

/neuripsw/2022/zhao2022neuripsw-importance/

Abstract

There has been increasing interest in synthesizing data to improve downstream text-to-SQL tasks. In this paper, we examined the existing synthesized datasets and discovered that state-of-the-art text-to-SQL algorithms did not further improve on popular benchmarks when trained with augmented synthetic data. We observed two shortcomings: illogical synthetic SQL queries from independent column sampling and arbitrary table joins. To address these issues, we propose a novel synthesis framework that incorporates key relationships from schema, imposes strong typing, and conducts schema-distance-weighted column sampling. We also adopt an intermediate representation (IR) for the SQL-to-text task to further improve the quality of the generated natural language questions. When existing powerful semantic parsers are pre-finetuned on our high-quality synthesized data, these models have significant accuracy boosts and achieve new state-of-the-art performance on Spider.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Zhao et al. "Importance of Synthesizing High-Quality Data for Text-to-SQL Parsing." NeurIPS 2022 Workshops: SyntheticData4ML, 2022.

Markdown

[Zhao et al. "Importance of Synthesizing High-Quality Data for Text-to-SQL Parsing." NeurIPS 2022 Workshops: SyntheticData4ML, 2022.](https://mlanthology.org/neuripsw/2022/zhao2022neuripsw-importance/)

BibTeX

@inproceedings{zhao2022neuripsw-importance,
  title     = {{Importance of Synthesizing High-Quality Data for Text-to-SQL Parsing}},
  author    = {Zhao, Yiyun and Jiang, Jiarong and Hu, Yiqun and Lan, Wuwei and Zhu, Henghui and Chauhan, Anuj and Li, Alexander Hanbo and Pan, Lin and Wang, Jun and Hang, Chung-Wei and Zhang, Sheng and Dong, Mingwen and Lilien, Joseph and Ng, Patrick and Wang, Zhiguo and Castelli, Vittorio and Xiang, Bing},
  booktitle = {NeurIPS 2022 Workshops: SyntheticData4ML},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/zhao2022neuripsw-importance/}
}