VidLPRO: A Video-Language Pre-Training Framework for Robotic and Laparoscopic Surgery
Abstract
We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery. While existing surgical VL models primarily rely on contrastive learning, we propose a more comprehensive approach to capture the intricate temporal dynamics and align video with language. VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations. To support this framework, we present GenSurg+, a carefully curated dataset derived from GenSurgery, comprising 17k surgical video clips paired with captions generated by GPT-4 using transcripts extracted by the Whisper model. This dataset addresses the need for large-scale, high-quality VL data in the surgical domain. Extensive experiments on benchmark datasets, including Cholec80 and AutoLaparo, demonstrate the efficacy of our approach. VidLPRO achieves state-of-the-art performance in zero-shot surgical phase recognition, significantly outperforming existing surgical VL models such as SurgVLP and HecVL. Our model demonstrates improvements of up to 21.5% in accuracy and 15.7% in F1 score, setting a new benchmark in the field. Notably, VidLPRO exhibits robust performance even with single-frame inference, while effectively scaling with increased temporal context. Ablation studies reveal the impact of frame sampling strategies on model performance and computational efficiency. These results underscore VidLPRO's potential as a foundation model for surgical video understanding.
Cite
Text
Honarmand et al. "VidLPRO: A Video-Language Pre-Training Framework for Robotic and Laparoscopic Surgery." NeurIPS 2024 Workshops: AIM-FM, 2024.Markdown
[Honarmand et al. "VidLPRO: A Video-Language Pre-Training Framework for Robotic and Laparoscopic Surgery." NeurIPS 2024 Workshops: AIM-FM, 2024.](https://mlanthology.org/neuripsw/2024/honarmand2024neuripsw-vidlpro/)BibTeX
@inproceedings{honarmand2024neuripsw-vidlpro,
title = {{VidLPRO: A Video-Language Pre-Training Framework for Robotic and Laparoscopic Surgery}},
author = {Honarmand, Mohammadmahdi and Jamal, Muhammad Abdullah and Mohareri, Omid},
booktitle = {NeurIPS 2024 Workshops: AIM-FM},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/honarmand2024neuripsw-vidlpro/}
}