Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?
Abstract
Vision Transformers (ViTs) have demonstrated exceptional performance in medical image analysis, yet their computational demands hinder clinical deployment, particularly in time-sensitive applications. Medical imaging requires sample-adaptive optimization due to dataset heterogeneity across modalities and sample complexity; uniform strategies do not well balance efficiency and accuracy. We propose a unified adaptive inference framework that combines Token Reduction (TR) and Early Exiting (EE) through dataset-specific profiling. Our approach quantifies spatial redundancy via Jensen-Shannon Divergence (JSD) and prediction confidence at intermediate layers to train a lightweight predictor that dynamically selects inference strategies at test time. Across five medical datasets, including a real-world cataract dataset (INSIGHT), our framework achieves 71.4% average floating-point operations (FLOPs) reduction with only 0.1pp accuracy loss, substantially outperforming individual strategies (EE-only: 55.9%, TR-only: 57.7%). On PathMNIST, our adaptive inference framework simultaneously improves accuracy by 1.3pp while reducing computation by 77.2%. On INSIGHT, we maintain baseline accuracy with 69.8% FLOPs reduction, demonstrating robust real-world clinical applicability.
Cite
Text
Byun et al. "Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.Markdown
[Byun et al. "Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.](https://mlanthology.org/midl/2026/byun2026midl-adaptive/)BibTeX
@inproceedings{byun2026midl-adaptive,
title = {{Adaptive Inference for Medical Vision Transformers: Token Reduction or Early Exit?}},
author = {Byun, Ji Young and Lee, HyunSeo and Shuff, Jordan and Venkatesh, Rengaraj and Shekhawat, Nakul S. and Parikh, Kunal S. and Chellappa, Rama},
booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
year = {2026},
pages = {2171-2191},
volume = {315},
url = {https://mlanthology.org/midl/2026/byun2026midl-adaptive/}
}