Medical Interpretability and Knowledge Maps of Large Language Models

Marinescu, Razvan; Gruber, Victoria-Elisabeth; V., Diego Fajardo

Medical Interpretability and Knowledge Maps of Large Language Models

Razvan Marinescu, Victoria-Elisabeth Gruber, Diego Fajardo V.

ICLR 2026

/iclr/2026/marinescu2026iclr-medical/

Abstract

We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers. In addition, we find several interesting phenomena: (i) age is often encoded in a non-linear and sometimes discontinuous manner at intermediate layers in the models, (ii) the disease progression representation is non-monotonic and circular at certain layers of the model, (iii) in Llama, drugs cluster better by medical specialty rather than mechanism of action, especially for Llama and (iv) Gemma-27B and MedGemma-27B have activations that collapse at intermediate layers but recover by the final layers. These results can guide future research on fine-tuning, un-learning or de-biasing LLMs for medical tasks by suggesting at which layers in the model these techniques should be applied. We attached our source code to the paper for reproducibility.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Marinescu et al. "Medical Interpretability and Knowledge Maps of Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Marinescu et al. "Medical Interpretability and Knowledge Maps of Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/marinescu2026iclr-medical/)

BibTeX

@inproceedings{marinescu2026iclr-medical,
  title     = {{Medical Interpretability and Knowledge Maps of Large Language Models}},
  author    = {Marinescu, Razvan and Gruber, Victoria-Elisabeth and V., Diego Fajardo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/marinescu2026iclr-medical/}
}