Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions

Lacerda, Anísio; Pappa, Gisele L.; Pereira, Adriano César Machado; Jr., Wagner Meira; de Almeida Barros, Alexandre Guimarães

doi:10.24963/IJCAI.2025/1169

Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions

Anísio Lacerda, Gisele L. Pappa, Adriano César Machado Pereira, Wagner Meira Jr., Alexandre Guimarães de Almeida Barros

IJCAI 2025 pp. 10528-10536

doi:10.24963/IJCAI.2025/1169 /ijcai/2025/lacerda2025ijcai-evaluation/

Abstract

The integration of Large Language Models (LLMs) into medicine presents both great opportunities and significant challenges, particularly in ensuring these models are accurate, reliable, and safe. While LLMs have shown impressive capabilities in understanding and generating human language, their application in the medical domain requires careful evaluation due to the critical nature of medical applications which are inherently linked to patient life and health. Current evaluations of LLMs in medicine are often fragmented and insufficient, with a lack of standardized performance metrics, limited use of real patient data, and insufficient attention to important applications, such as documentation, education, and research. Furthermore, traditional NLP-based evaluations are often inadequate for assessing the text generated by LLMs. Therefore, a robust evaluation is essential to ensure the responsible and effective use of LLMs in medical settings, and to address the inherent challenges associated with their implementation. This paper explores the various dimensions of LLM evaluation in the medical domain, proposes a new taxonomy for categorizing medical applications, and discusses directions for future research in this critical area.

PDF IJCAI Semantic Scholar

Cite

Text

Lacerda et al. "Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions." International Joint Conference on Artificial Intelligence, 2025. doi:10.24963/IJCAI.2025/1169

Markdown

[Lacerda et al. "Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions." International Joint Conference on Artificial Intelligence, 2025.](https://mlanthology.org/ijcai/2025/lacerda2025ijcai-evaluation/) doi:10.24963/IJCAI.2025/1169

BibTeX

@inproceedings{lacerda2025ijcai-evaluation,
  title     = {{Evaluation of Medical Large Language Models: Taxonomy, Review, and Directions}},
  author    = {Lacerda, Anísio and Pappa, Gisele L. and Pereira, Adriano César Machado and Jr., Wagner Meira and de Almeida Barros, Alexandre Guimarães},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {10528-10536},
  doi       = {10.24963/IJCAI.2025/1169},
  url       = {https://mlanthology.org/ijcai/2025/lacerda2025ijcai-evaluation/}
}