THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models

Abstract

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses which we term "Type I hallucinations". Instead they focus on hallucinations responding to very specific question formats---typically a multiple-choice response regarding a particular object or attribute---which we term "Type II hallucinations". Additionally such benchmarks often require external API calls to models which are subject to change. In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations and that established benchmarks for measuring Type I hallucinations are incomplete. Finally we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

Cite

Text

Kaul et al. "THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2024. doi:10.1109/CVPR52733.2024.02571

Markdown

[Kaul et al. "THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models." Conference on Computer Vision and Pattern Recognition, 2024.](https://mlanthology.org/cvpr/2024/kaul2024cvpr-throne/) doi:10.1109/CVPR52733.2024.02571

BibTeX

@inproceedings{kaul2024cvpr-throne,
  title     = {{THRONE: An Object-Based Hallucination Benchmark for the Free-Form Generations of Large Vision-Language Models}},
  author    = {Kaul, Prannay and Li, Zhizhong and Yang, Hao and Dukler, Yonatan and Swaminathan, Ashwin and Taylor, C. J. and Soatto, Stefano},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2024},
  pages     = {27228-27238},
  doi       = {10.1109/CVPR52733.2024.02571},
  url       = {https://mlanthology.org/cvpr/2024/kaul2024cvpr-throne/}
}