Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation

Lin, Yu-Nong Scarlett; Wang, Shansong; Safari, Mojtaba; Yang, Xiaofeng

Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation

Yu-Nong Scarlett Lin, Shansong Wang, Mojtaba Safari, Xiaofeng Yang

MIDL 2026 pp. 2780-2794

/midl/2026/lin2026midl-beyond/

Abstract

Self-supervised vision transformers like DINOv3 are strong universal feature extractors, yet their transferability to functional medical imaging remains limited when pretrained on misaligned natural-image domains. In this work, we introduce Dual-DINOv3, a dual-stream framework for PET/CT that addresses two key gaps in existing work: the absence of a public, PET-specific pretrained encoder and the reliance on fully paired PET/CT data for multimodal pretraining. First, we presented the first PET-specific DINOv3 encoder, pretrained exclusively on large-scale public FDG-PET datasets using the full three-stage DINOv3 self-distillation pipeline. Second, we proposed a modality-separated PET/CT framework in which PET- and CT-specific encoders are pretrained independently and fused during finetuning via multiscale cross-attention, enabling multimodal representation learning without requiring paired data during pretraining. Evaluation on the HECKTOR tumor segmentation benchmark demonstrated three central findings: (1) misaligned natural-image pretraining degrades PET/CT performance relative to training from scratch, (2) domain-aligned CT pretraining substantially improves segmentation across all tumor sizes, and (3) dual-stream PET/CT pretraining achieves the best performance overall, highlighting the complementary contributions of functional and anatomical cues. Together, these results provide a fully public PET encoder and a scalable PET/CT foundation model that support domain-aligned representation learning under realistic clinical data constraints.

PDF MIDL Semantic Scholar

Cite

Text

Lin et al. "Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.

Markdown

[Lin et al. "Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation." Proceedings of The 9th International Conference on Medical Imaging with Deep Learning, 2026.](https://mlanthology.org/midl/2026/lin2026midl-beyond/)

BibTeX

@inproceedings{lin2026midl-beyond,
  title     = {{Beyond Natural Images: A Dual-Stream DINOv3 Framework for PET/CT Segmentation}},
  author    = {Lin, Yu-Nong Scarlett and Wang, Shansong and Safari, Mojtaba and Yang, Xiaofeng},
  booktitle = {Proceedings of The 9th International Conference on Medical Imaging with Deep Learning},
  year      = {2026},
  pages     = {2780-2794},
  volume    = {315},
  url       = {https://mlanthology.org/midl/2026/lin2026midl-beyond/}
}