UVIS: Unsupervised Video Instance Segmentation

Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla, Ser-Nam Lim, Abhinav Shrivastava

CVPRW 2024 pp. 2682-2692

doi:10.1109/CVPRW63382.2024.00274 /cvprw/2024/huang2024cvprw-uvis/

Abstract

Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the open-set recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework.

PDF CVPRW Semantic Scholar

Cite

Text

Huang et al. "UVIS: Unsupervised Video Instance Segmentation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024. doi:10.1109/CVPRW63382.2024.00274

Markdown

[Huang et al. "UVIS: Unsupervised Video Instance Segmentation." IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.](https://mlanthology.org/cvprw/2024/huang2024cvprw-uvis/) doi:10.1109/CVPRW63382.2024.00274

BibTeX

@inproceedings{huang2024cvprw-uvis,
  title     = {{UVIS: Unsupervised Video Instance Segmentation}},
  author    = {Huang, Shuaiyi and Suri, Saksham and Gupta, Kamal and Rambhatla, Sai Saketh and Lim, Ser-Nam and Shrivastava, Abhinav},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops},
  year      = {2024},
  pages     = {2682-2692},
  doi       = {10.1109/CVPRW63382.2024.00274},
  url       = {https://mlanthology.org/cvprw/2024/huang2024cvprw-uvis/}
}