Semi-Automated Construction of Complex Knowledge Base Question Answering Dataset Using Large Language Model

Abstract

Constructing knowledge base question answering (KBQA) dataset is a time-consuming task, especially when ground-truth logical forms such as SPARQL are included. In this paper, we present a framework which leverages Large Language Model (LLM), such as GPT-3.5, to semi-automatically construct a KBQA dataset, namely Movie Complex Question Answering (MCQA). During dataset construction, LLM assists in generating question types, question templates, and SPARQL templates, used to instantiate question samples drawn from a knowledge graph. To facilitate data construction and MCQA dataset utilization, we curate iMKG, a comprehensive knowledge graph for the movie domain, based on Wikidata and MovieKG. MCQA contains complex questions with ground-truth answers and SPARQL queries automatically sampled from iMKG . Experimental results when evaluating two state-of-the-art KBQA methods on our new dataset show that MCQA is a challenging, yet promising KBQA benchmark that has the potential to stimulate advancement of more sophisticated KBQA methods. The MCQA dataset including iMKG can be downloaded from this Github link .

Cite

Text

Hoang et al. "Semi-Automated Construction of Complex Knowledge Base Question Answering Dataset Using Large Language Model." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024. doi:10.1007/978-3-031-70362-1_14

Markdown

[Hoang et al. "Semi-Automated Construction of Complex Knowledge Base Question Answering Dataset Using Large Language Model." European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2024.](https://mlanthology.org/ecmlpkdd/2024/hoang2024ecmlpkdd-semiautomated/) doi:10.1007/978-3-031-70362-1_14

BibTeX

@inproceedings{hoang2024ecmlpkdd-semiautomated,
  title     = {{Semi-Automated Construction of Complex Knowledge Base Question Answering Dataset Using Large Language Model}},
  author    = {Hoang, Lily and Liausvia, Fiona and Liu, Yan and Nguyen, Thanh-Son},
  booktitle = {European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases},
  year      = {2024},
  pages     = {230-248},
  doi       = {10.1007/978-3-031-70362-1_14},
  url       = {https://mlanthology.org/ecmlpkdd/2024/hoang2024ecmlpkdd-semiautomated/}
}