MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Abstract

Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. In this work, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for multimodal fact-checking (MFC): Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in MFC and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future.

Cite

Text

Wang et al. "MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models." ICLR 2025 Workshops: BuildingTrust, 2025.

Markdown

[Wang et al. "MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models." ICLR 2025 Workshops: BuildingTrust, 2025.](https://mlanthology.org/iclrw/2025/wang2025iclrw-mfcbench/)

BibTeX

@inproceedings{wang2025iclrw-mfcbench,
  title     = {{MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models}},
  author    = {Wang, Shengkang and Lin, Hongzhan and Luo, Ziyang and Ye, Zhen and Chen, Guang and Ma, Jing},
  booktitle = {ICLR 2025 Workshops: BuildingTrust},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/wang2025iclrw-mfcbench/}
}