URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark

Abstract

We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the object masks referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at \url{https://github.com/skynbe/Refer-Youtube-VOS}.

Cite

Text

Seo et al. "URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark." Proceedings of the European Conference on Computer Vision (ECCV), 2020. doi:10.1007/978-3-030-58555-6_13

Markdown

[Seo et al. "URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark." Proceedings of the European Conference on Computer Vision (ECCV), 2020.](https://mlanthology.org/eccv/2020/seo2020eccv-urvos/) doi:10.1007/978-3-030-58555-6_13

BibTeX

@inproceedings{seo2020eccv-urvos,
  title     = {{URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark}},
  author    = {Seo, Seonguk and Lee, Joon-Young and Han, Bohyung},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2020},
  doi       = {10.1007/978-3-030-58555-6_13},
  url       = {https://mlanthology.org/eccv/2020/seo2020eccv-urvos/}
}