HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild

Abstract

Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions.

Cite

Text

Zhu et al. "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Zhu et al. "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/zhu2025iclrw-haluevalwild/)

BibTeX

@inproceedings{zhu2025iclrw-haluevalwild,
  title     = {{HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild}},
  author    = {Zhu, Zhiying and Yang, Yiming and Sun, Zhiqing},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/zhu2025iclrw-haluevalwild/}
}