Statistical Knowledge Assessment for Large Language Models
Abstract
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? Existing LLMs may generate distinct responses for different prompts. In this paper, we study the problem of quantifying knowledge contained in an LLM regarding a given set of facts. We propose KaRR, a statistical approach to assess factual knowledge for LLMs. The main idea is to estimate the ratio of LLM generating text corresponding to the answer entity given diverse prompts of the subject and the querying relation, versus it generating by random chances. Our assessment suite contains a comprehensive set of 994,123 entities and 600 relations, with 1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes, including LLaMA, Alpaca, OPT, etc. Experiments show that our results have a strong correlation (0.43 Kendall's $\tau$) with the results of human assessment on LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
Cite
Text
Dong et al. "Statistical Knowledge Assessment for Large Language Models." Neural Information Processing Systems, 2023.Markdown
[Dong et al. "Statistical Knowledge Assessment for Large Language Models." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/dong2023neurips-statistical/)BibTeX
@inproceedings{dong2023neurips-statistical,
title = {{Statistical Knowledge Assessment for Large Language Models}},
author = {Dong, Qingxiu and Xu, Jingjing and Kong, Lingpeng and Sui, Zhifang and Li, Lei},
booktitle = {Neural Information Processing Systems},
year = {2023},
url = {https://mlanthology.org/neurips/2023/dong2023neurips-statistical/}
}