Self-Supervised Video Object Segmentation by Motion Grouping

Abstract

Animals have evolved highly functional visual systems to understand motion, assisting perception even under complex environments. In this paper, we work towards developing a computer vision system able to segment objects by exploiting motion cues, i.e. motion segmentation. To achieve this, we introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background, which can be trained in a self-supervised manner, i.e. without using any manual annotations. Despite using only optical flow, and no appearance information, as input, our approach achieves superior results compared to previous state-of-the-art self-supervised methods on public benchmarks (DAVIS2016, SegTrackv2, FBMS59), while being an order of magnitude faster. On a challenging camouflage dataset (MoCA), we significantly outperform other self-supervised approaches, and are competitive with the top supervised approach, highlighting the importance of motion cues and the potential bias towards appearance in existing video segmentation models.

Cite

Text

Yang et al. "Self-Supervised Video Object Segmentation by Motion Grouping." International Conference on Computer Vision, 2021. doi:10.1109/ICCV48922.2021.00709

Markdown

[Yang et al. "Self-Supervised Video Object Segmentation by Motion Grouping." International Conference on Computer Vision, 2021.](https://mlanthology.org/iccv/2021/yang2021iccv-selfsupervised/) doi:10.1109/ICCV48922.2021.00709

BibTeX

@inproceedings{yang2021iccv-selfsupervised,
  title     = {{Self-Supervised Video Object Segmentation by Motion Grouping}},
  author    = {Yang, Charig and Lamdouar, Hala and Lu, Erika and Zisserman, Andrew and Xie, Weidi},
  booktitle = {International Conference on Computer Vision},
  year      = {2021},
  pages     = {7177-7188},
  doi       = {10.1109/ICCV48922.2021.00709},
  url       = {https://mlanthology.org/iccv/2021/yang2021iccv-selfsupervised/}
}