Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Kirchner, Max; Jenke, Alexander C.; Bodenstedt, Sebastian; Kolbinger, Fiona R.; Saldanha, Oliver L.; Kather, Jakob N.; Wagner, Martin; Speidel, Stefanie

Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner, Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob N. Kather, Martin Wagner, Stefanie Speidel

MIDL 2026 pp. 1903-1934

/midl/2026/kirchner2026midl-federated/

Abstract

Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.

PDF MIDL Semantic Scholar

Cite

Text

Kirchner et al. "Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.

Markdown

[Kirchner et al. "Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.](https://mlanthology.org/midl/2026/kirchner2026midl-federated/)

BibTeX

@inproceedings{kirchner2026midl-federated,
  title     = {{Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections}},
  author    = {Kirchner, Max and Jenke, Alexander C. and Bodenstedt, Sebastian and Kolbinger, Fiona R. and Saldanha, Oliver L. and Kather, Jakob N. and Wagner, Martin and Speidel, Stefanie},
  booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  year      = {2026},
  pages     = {1903-1934},
  volume    = {315},
  url       = {https://mlanthology.org/midl/2026/kirchner2026midl-federated/}
}