Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation
Abstract
Traditional video instance segmentation (VIS) models rely on extensive per-frame video annotations, which are both time-consuming and costly. In this paper, we present MinMaxVIS, a novel VIS framework that reduces the dependency on fully labeled video datasets by utilizing a small set of labeled images from the target domain along with a large volume of general-domain, unlabeled images. MinMaxVIS operates in three stages: first, a preliminary segmentation model is trained on the small labeled set from the target domain; this model then retrieves relevant instances from the unlabeled dataset to build a high-quality pseudo-labeled set, ensuring a rich content alignment with the target domain while avoiding the inefficiencies of large-scale semi-supervised learning across the entire unlabeled set. Finally, we train MinMaxVIS on a combination of labeled and pseudo-labeled data, addressing challenges such as noise in pseudo-labels and instance association across frames. To simulate object continuity, we augment static images to create paired frames, allowing MinMaxVIS to capture instance associations effectively. MinMaxVIS outperforms the prior image-driven approach, MinVIS, achieving superior mAP scores with significantly reduced labeled data. For instance, MinMaxVIS with a Swin-L backbone attains 62.2 mAP on YouTube-VIS 2019 using only 2% labeled data and additional unlabeled images from SA-1B. This surpasses MinVIS, which uses the same backbone trained on fully labeled YouTube-VIS 2019, by 0.6 mAP.
Cite
Text
Wei et al. "Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01798Markdown
[Wei et al. "Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/wei2025cvpr-minimizing/) doi:10.1109/CVPR52734.2025.01798BibTeX
@inproceedings{wei2025cvpr-minimizing,
title = {{Minimizing Labeled, Maximizing Unlabeled: An Image-Driven Approach for Video Instance Segmentation}},
author = {Wei, Fangyun and Zhao, Jinjing and Yan, Kun and Xu, Chang},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2025},
pages = {19304-19314},
doi = {10.1109/CVPR52734.2025.01798},
url = {https://mlanthology.org/cvpr/2025/wei2025cvpr-minimizing/}
}