MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation
Abstract
This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to x5 faster, and with x32 fewer parameters.
Cite
Text
Miles et al. "MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation." Conference on Computer Vision and Pattern Recognition, 2023. doi:10.1109/CVPR52729.2023.01010Markdown
[Miles et al. "MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation." Conference on Computer Vision and Pattern Recognition, 2023.](https://mlanthology.org/cvpr/2023/miles2023cvpr-mobilevos/) doi:10.1109/CVPR52729.2023.01010BibTeX
@inproceedings{miles2023cvpr-mobilevos,
title = {{MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation}},
author = {Miles, Roy and Yucel, Mehmet Kerim and Manganelli, Bruno and Saà-Garriga, Albert},
booktitle = {Conference on Computer Vision and Pattern Recognition},
year = {2023},
pages = {10480-10490},
doi = {10.1109/CVPR52729.2023.01010},
url = {https://mlanthology.org/cvpr/2023/miles2023cvpr-mobilevos/}
}