A Vision Foundation Model for Cataract Surgery Using Joint-Embedding Predictive Architecture
Abstract
Vision foundation models can automate analysis of surgical videos and enable multiple applications that support patient care and surgical training. For cataract surgery, existing models are limited by reliance on small datasets, privacy concerns, and poor generalizability across surgical settings. In this paper, we introduce JHU-VPT(JEPA), a self-supervised vision foundation model leveraging Joint-Embedding Predictive Architecture (JEPA) to learn spatiotemporal representations via latent feature prediction on a large corpus of unlabeled cataract videos, without requiring extensive labeled datasets or pixel-level reconstruction. JHU-VPT(JEPA) is pretrained on 2591 videos from multiple sites that capture different surgical technique and style variations. Comprehensive evaluations on step recognition, surgical feedback, and skill assessment tasks demonstrate that JHU-VPT(JEPA) outperforms existing methods. JHU-VPT(JEPA)’s effectiveness is evident even when using attentive probing with a frozen encoder, highlighting the robustness of the learned features and addressing privacy concerns by not requiring access to raw videos during downstream tasks. Our approach offers a scalable, generalizable, and privacy-preserving solution for surgical video analysis, with significant potential to advance patient care and surgical education.
Cite
Text
Shah et al. "A Vision Foundation Model for Cataract Surgery Using Joint-Embedding Predictive Architecture." Medical Imaging with Deep Learning, 2025.Markdown
[Shah et al. "A Vision Foundation Model for Cataract Surgery Using Joint-Embedding Predictive Architecture." Medical Imaging with Deep Learning, 2025.](https://mlanthology.org/midl/2025/shah2025midl-vision/)BibTeX
@inproceedings{shah2025midl-vision,
title = {{A Vision Foundation Model for Cataract Surgery Using Joint-Embedding Predictive Architecture}},
author = {Shah, Nisarg A and Xia, Mingze and Vijay, Subhasri and Sikder, Shameema and Vedula, S. Swaroop and Patel, Vishal M.},
booktitle = {Medical Imaging with Deep Learning},
year = {2025},
url = {https://mlanthology.org/midl/2025/shah2025midl-vision/}
}