MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation

Abstract

This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to x5 faster, and with x32 fewer parameters.

Cite

Text

Miles et al. "MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01010

Markdown

[Miles et al. "MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/miles2023cvpr-mobilevos/) doi:10.1109/CVPR52729.2023.01010

BibTeX

@inproceedings{miles2023cvpr-mobilevos,
  title     = {{MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation}},
  author    = {Miles, Roy and Yucel, Mehmet Kerim and Manganelli, Bruno and Saà-Garriga, Albert},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2023},
  pages     = {10480-10490},
  doi       = {10.1109/CVPR52729.2023.01010},
  url       = {https://mlanthology.org/cvpr/2023/miles2023cvpr-mobilevos/}
}