Two-Shot Video Object Segmentation

Yan, Kun; Li, Xiao; Wei, Fangyun; Wang, Jinglu; Zhang, Chenbin; Wang, Ping; Lu, Yan

doi:10.1109/CVPR52729.2023.00224

Two-Shot Video Object Segmentation

Kun Yan, Xiao Li, Fangyun Wei, Jinglu Wang, Chenbin Zhang, Ping Wang, Yan Lu

CVPR 2023 pp. 2257-2267

doi:10.1109/CVPR52729.2023.00224 /cvpr/2023/yan2023cvpr-twoshot/

Abstract

Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos--we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.

PDF CVPR Semantic Scholar

Cite

Text

Yan et al. "Two-Shot Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.00224

Markdown

[Yan et al. "Two-Shot Video Object Segmentation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/yan2023cvpr-twoshot/) doi:10.1109/CVPR52729.2023.00224

BibTeX

@inproceedings{yan2023cvpr-twoshot,
  title     = {{Two-Shot Video Object Segmentation}},
  author    = {Yan, Kun and Li, Xiao and Wei, Fangyun and Wang, Jinglu and Zhang, Chenbin and Wang, Ping and Lu, Yan},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {2257-2267},
  doi       = {10.1109/CVPR52729.2023.00224},
  url       = {https://mlanthology.org/cvpr/2023/yan2023cvpr-twoshot/}
}