HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild
Abstract
Hallucinations pose a significant challenge to the reliability of large language models (LLMs) in critical domains. Recent benchmarks designed to assess LLM hallucinations within conventional NLP tasks, such as knowledge-intensive question answering (QA) and summarization, are insufficient for capturing the complexities of user-LLM interactions in dynamic, real-world settings. To address this gap, we introduce HaluEval-Wild, the first benchmark specifically designed to evaluate LLM hallucinations in the wild. We meticulously collect challenging (adversarially filtered by Alpaca) user queries from ShareGPT, an existing real-world user-LLM interaction datasets, to evaluate the hallucination rates of various LLMs. Upon analyzing the collected queries, we categorize them into five distinct types, which enables a fine-grained analysis of the types of hallucinations LLMs exhibit, and synthesize the reference answers with the powerful GPT-4 model and retrieval-augmented generation (RAG). Our benchmark offers a novel approach towards enhancing our comprehension of and improving LLM reliability in scenarios reflective of real-world interactions.
Cite
Text
Zhu et al. "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." ICLR 2025 Workshops: BuildingTrust, 2025.Markdown
[Zhu et al. "HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/zhu2025iclrw-haluevalwild/)BibTeX
@inproceedings{zhu2025iclrw-haluevalwild,
title = {{HaluEval-Wild: Evaluating Hallucinations of Language Models in the Wild}},
author = {Zhu, Zhiying and Yang, Yiming and Sun, Zhiqing},
booktitle = {ICLR 2025 Workshops: BuildingTrust},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/zhu2025iclrw-haluevalwild/}
}