Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation
Abstract
Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically nonviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We find that this limitation stems not from model capabilities but from insufficient training on high-quality document-level KG data. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we repurpose existing question-answering datasets to construct KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently improves in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.
Cite
Text
Choubey et al. "Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation." International Conference on Learning Representations, 2026.Markdown
[Choubey et al. "Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/choubey2026iclr-scaling/)BibTeX
@inproceedings{choubey2026iclr-scaling,
title = {{Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation}},
author = {Choubey, Prafulla Kumar and Su, Xin and Luo, Man and Peng, Xiangyu and Xiong, Caiming and Le, Tiep and Rosenman, Shachar and Lal, Vasudev and Mui, Phil L and Ho, Ricky and Howard, Phillip and Wu, Chien-Sheng},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/choubey2026iclr-scaling/}
}