Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Zhang, Liangliang; Jiang, Zhuorui; Chi, Hongliang; Chen, Haoyang; ElKoumy, Mohammed; Wang, Fali; Wu, Qiong; Zhou, Zhengyi; Pan, Shirui; Wang, Suhang; Ma, Yao

Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed ElKoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma

NeurIPS 2025

/neurips/2025/zhang2025neurips-diagnosing/

Abstract

Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets—including WebQSP and CWQ—we find that the average factual correctness rate is only 57%. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a 10K-scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Zhang et al. "Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking." Advances in Neural Information Processing Systems, 2025.

Markdown

[Zhang et al. "Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/zhang2025neurips-diagnosing/)

BibTeX

@inproceedings{zhang2025neurips-diagnosing,
  title     = {{Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking}},
  author    = {Zhang, Liangliang and Jiang, Zhuorui and Chi, Hongliang and Chen, Haoyang and ElKoumy, Mohammed and Wang, Fali and Wu, Qiong and Zhou, Zhengyi and Pan, Shirui and Wang, Suhang and Ma, Yao},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/zhang2025neurips-diagnosing/}
}