Studying Image Diffusion Features for Zero-Shot Video Object Segmentation

Abstract

This paper investigates the use of large-scale diffusion models for Zero-Shot Video Object Segmentation (ZS-VOS) without fine-tuning on video data or training on any image segmentation data. While diffusion models have demonstrated strong visual representations across various tasks, their direct application to ZS-VOS remains underexplored. Our goal is to find the optimal feature extraction process for ZS-VOS by identifying the most suitable time step and layer from which to extract features. We further analyze the affinity of these features and observe a strong correlation with point correspondences. Through extensive experiments on DAVIS-17 and MOSE, we find that diffusion models trained on ImageNet outperform those trained on larger, more diverse datasets for ZS-VOS. Additionally, we highlight the importance of point correspondences in achieving high segmentation accuracy, and we yield state-of-the-art results in ZS-VOS. Finally, our approach performs on par with models trained on expensive image segmentation datasets.

Cite

Text

Delatolas et al. "Studying Image Diffusion Features for Zero-Shot Video Object Segmentation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.

Markdown

[Delatolas et al. "Studying Image Diffusion Features for Zero-Shot Video Object Segmentation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025.](https://mlanthology.org/cvprw/2025/delatolas2025cvprw-studying/)

BibTeX

@inproceedings{delatolas2025cvprw-studying,
  title     = {{Studying Image Diffusion Features for Zero-Shot Video Object Segmentation}},
  author    = {Delatolas, Thanos and Kalogeiton, Vicky and Papadopoulos, Dim P.},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2025},
  pages     = {2636-2647},
  url       = {https://mlanthology.org/cvprw/2025/delatolas2025cvprw-studying/}
}