MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery

Abstract

Reliable perception of surgical instruments is a key prerequisite for intraoperative guidance, context-aware assistance, and workflow analysis in minimally invasive surgery (MIS). This is particularly challenging in skull base procedures, where narrow anatomical corridors, frequent occlusions, specular highlights, and visually similar instruments make multi-class detection and 2D pose estimation difficult. We address joint instrument detection and keypoint-based pose estimation from monocular endoscopic videos and introduce MID-POSE, a dual-head architecture that couples a high-resolution HRNetV2p encoder with a class-agnostic dense detection-pose head and a Multi-level Instrument Classification (MIC) head which operates on RoI-aligned multi-level features. To support this task, we construct the PitSurg dataset from 26 clinical procedures, providing seven instrument classes with bounding boxes and detailed 2D keypoints. Using YOLOv8x-pose as our strongest baseline, which in our tasks outperforms YOLO11x-pose, MID-POSE improves Det/Pose $AP_{50\text{–}95}$ on PitSurg from $59.4/63.1$ to $77.5/78.5$ and on the robotic SurgPose dataset from $47.9/61.1$ to $62.7/71.4$. Qualitative analysis shows that high-resolution features sharpen localisation and keypoint placement, while the RoI classifier reduces misclassifications and spurious background detections, indicating that the proposed architecture and dataset provide an effective basis for robust multi-instrument perception in MIS.

Cite

Text

Wei et al. "MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.

Markdown

[Wei et al. "MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.](https://mlanthology.org/midl/2026/wei2026midl-midpose/)

BibTeX

@inproceedings{wei2026midl-midpose,
  title     = {{MID-POSE: Multi-Instrument Detection and Pose Estimation in Endoscopic Surgery}},
  author    = {Wei, Wenhua and Mennillo, Laurent and Mao, Zhehua and Wijekoon, Anjana and Feeny, Kendall and Khan, Danyal Zaman and Mazomenos, Evangelos B. and Stoyanov, Danail and Marcus, Hani J. and Bano, Sophia},
  booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  year      = {2026},
  pages     = {1095-1114},
  volume    = {315},
  url       = {https://mlanthology.org/midl/2026/wei2026midl-midpose/}
}