Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections
Abstract
Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.
Cite
Text
Kirchner et al. "Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.Markdown
[Kirchner et al. "Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.](https://mlanthology.org/midl/2026/kirchner2026midl-federated/)BibTeX
@inproceedings{kirchner2026midl-federated,
title = {{Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections}},
author = {Kirchner, Max and Jenke, Alexander C. and Bodenstedt, Sebastian and Kolbinger, Fiona R. and Saldanha, Oliver L. and Kather, Jakob N. and Wagner, Martin and Speidel, Stefanie},
booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
year = {2026},
pages = {1903-1934},
volume = {315},
url = {https://mlanthology.org/midl/2026/kirchner2026midl-federated/}
}