Empirical Analysis of Scaling Vision Foundation Models for Chest X-Rays

Al Mahrooqi, Ahmed; Munjal, Prateek; Rajan, Ronnie; Pimentel, Marco AF; Kanithi, Praveenkumar

Empirical Analysis of Scaling Vision Foundation Models for Chest X-Rays

Ahmed Al Mahrooqi, Prateek Munjal, Ronnie Rajan, Marco AF Pimentel, Praveenkumar Kanithi

MIDL 2025

/midl/2025/mahrooqi2025midl-empirical/

Abstract

Recent advancements in multimodal transformers have shown remarkable success in computer vision and natural language tasks, yet their adaptation to the clinical world remains challenging. We introduce CXformer, a vision transformer adapted for chest X-ray analysis, through systematic investigation of architectural choices and training modifications from DINOv2. Our empirical results show that using registers in ViT training, centering the teacher model's softmax outputs, and optimizing the number of heads leads to better performance. The small version of CXformer(S) (22M parameters) achieves 83.28% mean AUROC on CheXpert test set, surpassing the baseline of 80.46% achieved with vanilla DINOv2 settings. Contrary to common assumptions, our larger model CXformer(B) with 87M parameters shows similar performance at 84% mean AUROC on CheXpert, suggesting that training optimizations matter more than model size. Furthermore compared to the current state-of-the-art RAD-DINO, our CXformer(B), with 46% reduced pretraining compute (in FLOPs) achieves an average AUROC of 87.93% (vs 87.32% by RAD-DINO) on pathology image classification task evaluated across three widely used CXR datasets i.e. CheXpert, RSNA Pneumonia, and NIH CXR8. Beyond classification, CXformer also delivers competitive, and occasionally superior, performance in semantic segmentation and radiology report generation, underscoring its versatility. CXformer base and small models can be found at https://huggingface.co/m42-health

PDF MIDL OpenReview Semantic Scholar

Cite

Text

Al Mahrooqi et al. "Empirical Analysis of Scaling Vision Foundation Models for Chest X-Rays." Medical Imaging with Deep Learning, 2025.

Markdown

[Al Mahrooqi et al. "Empirical Analysis of Scaling Vision Foundation Models for Chest X-Rays." Medical Imaging with Deep Learning, 2025.](https://mlanthology.org/midl/2025/mahrooqi2025midl-empirical/)

BibTeX

@inproceedings{mahrooqi2025midl-empirical,
  title     = {{Empirical Analysis of Scaling Vision Foundation Models for Chest X-Rays}},
  author    = {Al Mahrooqi, Ahmed and Munjal, Prateek and Rajan, Ronnie and Pimentel, Marco AF and Kanithi, Praveenkumar},
  booktitle = {Medical Imaging with Deep Learning},
  year      = {2025},
  url       = {https://mlanthology.org/midl/2025/mahrooqi2025midl-empirical/}
}