Learning What to Learn for Video Object Segmentation

Abstract

Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking.

Cite

Text

Bhat et al. "Learning What to Learn for Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58536-5_46

Markdown

[Bhat et al. "Learning What to Learn for Video Object Segmentation." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/bhat2020eccv-learning/) doi:10.1007/978-3-030-58536-5_46

BibTeX

@inproceedings{bhat2020eccv-learning,
  title     = {{Learning What to Learn for Video Object Segmentation}},
  author    = {Bhat, Goutam and Lawin, Felix Järemo and Danelljan, Martin and Robinson, Andreas and Felsberg, Michael and Van Gool, Luc and Timofte, Radu},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58536-5_46},
  url       = {https://mlanthology.org/eccv/2020/bhat2020eccv-learning/}
}