MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health

Yu, Deli; Wang, Shengzhi; Wu, Kai; Ji, Xiaozhong; Cui, Bo; Cao, Jieqiong; Wang, Huichao; Jiang, Boyuan; Wang, Xu; Xu, Qian; ChaoGao,; Zhao, Yi; Chen, Dian; Li, Meng; Wu, Haifeng; He, Yijun; Yang, Haihua

MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health

Deli Yu, Shengzhi Wang, Kai Wu, Xiaozhong Ji, Bo Cui, Jieqiong Cao, Huichao Wang, Boyuan Jiang, Xu Wang, Qian Xu, ChaoGao, Yi Zhao, Dian Chen, Meng Li, Haifeng Wu, Yijun He, Haihua Yang

ICLR 2026

/iclr/2026/yu2026iclr-medlesionvqa/

Abstract

Body-surface health conditions, spanning diverse clinical departments, represent some of the most frequent diagnostic scenarios and a primary target for medical multimodal large language models (MLLMs). Yet existing medical benchmarks are either built from publicly available sources with limited expert curation or focus narrowly on disease classification, failing to reflect the stepwise recognition and reasoning processes physicians follow in real practice. To address this gap, we introduce MedLesionVQA, the first benchmark explicitly designed to evaluate MLLMs on the visual diagnostic workflow for body-surface conditions in large scale. All questions are derived from authentic clinical visual diagnosis scenarios and verified by medical experts with over 20 years of experience, while the data are drawn from 10k+ real patient visits, ensuring authenticity, clinical reality and diversity. MedLesionVQA consists of 12K in-house images (never publicly leaked) and 19K expert-verified question–answer pairs, with fine-grained annotations of 94 lesion types, 110 body regions, and 96 diseases. We evaluate 20+ state-of-the-art MLLMs against human physicians: the best model reaches 56.2% accuracy, far below primary physicians (61.4%) and senior specialists (73.2%). These results expose the persistent gap between MLLMs and clinical expertise, underscoring the need for the multimodal benchmarks to drive trustworthy medical AI. The dataset can be found in https://github.com/bytedance/MedLesionVQA.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Yu et al. "MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health." International Conference on Learning Representations, 2026.

Markdown

[Yu et al. "MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/yu2026iclr-medlesionvqa/)

BibTeX

@inproceedings{yu2026iclr-medlesionvqa,
  title     = {{MedLesionVQA: A Multimodal Benchmark Emulating Clinical Visual Diagnosis for Body Surface Health}},
  author    = {Yu, Deli and Wang, Shengzhi and Wu, Kai and Ji, Xiaozhong and Cui, Bo and Cao, Jieqiong and Wang, Huichao and Jiang, Boyuan and Wang, Xu and Xu, Qian and ChaoGao,  and Zhao, Yi and Chen, Dian and Li, Meng and Wu, Haifeng and He, Yijun and Yang, Haihua},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/yu2026iclr-medlesionvqa/}
}