Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation

Abstract

Document-level knowledge graph (KG) construction faces a fundamental scaling challenge: existing methods either rely on expensive large language models (LLMs), making them economically nonviable for large-scale corpora, or employ smaller models that produce incomplete and inconsistent graphs. We find that this limitation stems not from model capabilities but from insufficient training on high-quality document-level KG data. To address this gap, we introduce SynthKG, a multi-step data synthesis pipeline that generates high-quality document-KG pairs through systematic chunking, decontextualization, and structured extraction using LLMs. By fine-tuning a smaller LLM on synthesized document-KG pairs, we streamline the multi-step process into a single-step KG generation approach called Distill-SynthKG. Furthermore, we repurpose existing question-answering datasets to construct KG evaluation datasets and introduce new evaluation metrics. Using KGs produced by Distill-SynthKG, we also design a novel graph-based retrieval framework for RAG. Experimental results demonstrate that Distill-SynthKG not only surpasses all baseline models in KG quality (including models up to eight times larger) but also consistently improves in retrieval and question-answering tasks. Additionally, our proposed graph retrieval framework outperforms all KG-retrieval methods across multiple benchmark datasets.

Cite

Text

Choubey et al. "Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation." International Conference on Learning Representations, 2026.

Markdown

[Choubey et al. "Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/choubey2026iclr-scaling/)

BibTeX

@inproceedings{choubey2026iclr-scaling,
  title     = {{Scaling Knowledge Graph Construction Through Synthetic Data Generation and Distillation}},
  author    = {Choubey, Prafulla Kumar and Su, Xin and Luo, Man and Peng, Xiangyu and Xiong, Caiming and Le, Tiep and Rosenman, Shachar and Lal, Vasudev and Mui, Phil L and Ho, Ricky and Howard, Phillip and Wu, Chien-Sheng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/choubey2026iclr-scaling/}
}