THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

Ma, Tzu-Yen; Zhang, Bo; Tang, Zichen; Ding, Junpeng; Tian, Haolin; Li, Yuanze; Hao, Zhuodi; Ding, Zixin; Wang, Zirui; Yu, Xinyu; Peng, Shiyao; Zhao, Yizhuo; Jiang, Ruomeng; Huang, Yiling; Zhao, Peizhi; Chen, Jiayuan; Tan, Weisheng; Gao, Haocheng; Liu, Yang; Liu, Jiacheng; Yang, Zhongjun; Huang, Jiayu; E, Haihong

THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics

ICLR 2026

/iclr/2026/ma2026iclr-themis/

Abstract

We present **THEMIS**, a novel multi-task benchmark designed to comprehensively evaluate multimodal large language models (MLLMs) on visual fraud reasoning within real-world academic scenarios. Compared to existing benchmarks, THEMIS introduces three major advances. (1) **Real-World Scenarios and Complexity**: Our benchmark comprises over 4,000 questions spanning seven scenarios, derived from authentic retracted-paper cases and carefully curated multimodal synthetic data. With 60.47\% complex-texture images, THEMIS bridges the critical gap between existing benchmarks and the complexity of real-world academic fraud. (2) **Fraud-Type Diversity and Granularity**: THEMIS systematically covers five challenging fraud types and introduces 16 fine-grained manipulation operations. On average, each sample undergoes multiple stacked manipulation operations, with the diversity and difficulty of these manipulations demanding a high level of visual fraud reasoning from the models. (3) **Multi-Dimensional Capability Evaluation**: We establish a mapping from fraud types to five core visual fraud reasoning capabilities, thereby enabling an evaluation that reveals the distinct strengths and specific weaknesses of different models across these core capabilities. Experiments on 16 leading MLLMs show that even the best-performing model, GPT-5, achieves an overall performance of only 56.15\%, demonstrating that our benchmark presents a stringent test. We expect THEMIS to advance the development of MLLMs for complex, real-world fraud reasoning tasks.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Ma et al. "THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics." International Conference on Learning Representations, 2026.

Markdown

[Ma et al. "THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ma2026iclr-themis/)

BibTeX

@inproceedings{ma2026iclr-themis,
  title     = {{THEMIS: Towards Holistic Evaluation of MLLMs for Scientific Paper Fraud Forensics}},
  author    = {Ma, Tzu-Yen and Zhang, Bo and Tang, Zichen and Ding, Junpeng and Tian, Haolin and Li, Yuanze and Hao, Zhuodi and Ding, Zixin and Wang, Zirui and Yu, Xinyu and Peng, Shiyao and Zhao, Yizhuo and Jiang, Ruomeng and Huang, Yiling and Zhao, Peizhi and Chen, Jiayuan and Tan, Weisheng and Gao, Haocheng and Liu, Yang and Liu, Jiacheng and Yang, Zhongjun and Huang, Jiayu and E, Haihong},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ma2026iclr-themis/}
}