GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning

Abstract

Clustering is a ubiquitous tool in unsupervised learning. Most of the existing self-supervised representation learning methods typically cluster samples based on visually dominant features. While this works well for image-based selfsupervision, it often fails for videos, which require understanding motion rather than focusing on background. Using optical flow as complementary information to RGB can alleviate this problem. However, we observe that a na¨ıve combination of the two modalities does not provide meaningful gains. In this paper, we propose a principled way to combine two modalities. Specifically, we propose a novel clustering strategy where we use the initial cluster assignment of each modality as prior to guide the final cluster assignment of the other modality. This idea will enforce similar cluster structures for both modalities, and the formed clusters will be semantically abstract and robust to noisy inputs coming from each individual modality. Additionally, we propose a novel regularization strategy to address the feature collapse problem, which is common in cluster-based self-supervised learning methods. Our extensive evaluation shows the effectiveness of our learned representations on downstream tasks, e.g., video retrieval and action recognition. Specifically, we outperform the state of the art by 7% on UCF and 4% on HMDB for video retrieval as well as 5% on UCF and 6% on HMDB for linear video classification.

Cite

Text

Coskun et al. "GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022. doi:10.1007/978-3-031-19821-2_1

Markdown

[Coskun et al. "GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning." Proceedings of the European Conference on Computer Vision (ECCV), 2022.](https://mlanthology.org/eccv/2022/coskun2022eccv-goca/) doi:10.1007/978-3-031-19821-2_1

BibTeX

@inproceedings{coskun2022eccv-goca,
  title     = {{GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning}},
  author    = {Coskun, Huseyin and Zareian, Alireza and Moore, Joshua L. and Tombari, Federico and Wang, Chen},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2022},
  doi       = {10.1007/978-3-031-19821-2_1},
  url       = {https://mlanthology.org/eccv/2022/coskun2022eccv-goca/}
}