Two-Shot Video Object Segmentation
Abstract
Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos--we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.
Cite
Text
Yan et al. "Two-Shot Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00224Markdown
[Yan et al. "Two-Shot Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yan2023cvpr-twoshot/) doi:10.1109/CVPR52729.2023.00224BibTeX
@inproceedings{yan2023cvpr-twoshot,
title = {{Two-Shot Video Object Segmentation}},
author = {Yan, Kun and Li, Xiao and Wei, Fangyun and Wang, Jinglu and Zhang, Chenbin and Wang, Ping and Lu, Yan},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {2257-2267},
doi = {10.1109/CVPR52729.2023.00224},
url = {https://mlanthology.org/cvpr/2023/yan2023cvpr-twoshot/}
}