SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos

Abstract

Self-supervised methods have significantly closed the gap with end-to-end supervised learning for image classification [13,24]. In the case of human action videos, however, where both appearance and motion are significant factors of variation, this gap remains significant [28,58]. One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives. A typical assumption is that similar clips only occur temporally close within a single video, leading to insufficient examples of motion similarity. To mitigate this, we propose SLIC, a clustering-based self-supervised contrastive learning method for human action videos. Our key contribution is that we improve upon the traditional intra-video positive sampling by using iterative clustering to group similar video instances. This enables our method to leverage pseudo-labels from the cluster assignments to sample harder positives and negatives. SLIC outperforms state-of-the-art video retrieval baselines by +15.4% on top-1 recall on UCF101 and by +5.7% when directly transferred to HMDB51. With end-to-end finetuning for action classification, SLIC achieves 83.2% top-1 accuracy (+0.8%) on UCF101 and 54.5% on HMDB51 (+1.6%). SLIC is also competitive with the state-of-the-art in action classification after self-supervised pretraining on Kinetics400.

Cite

Text

Khorasgani et al. "SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01562

Markdown

[Khorasgani et al. "SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/khorasgani2022cvpr-slic/) doi:10.1109/CVPR52688.2022.01562

BibTeX

@inproceedings{khorasgani2022cvpr-slic,
  title     = {{SLIC: Self-Supervised Learning with Iterative Clustering for Human Action Videos}},
  author    = {Khorasgani, Salar Hosseini and Chen, Yuxuan and Shkurti, Florian},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {16091-16101},
  doi       = {10.1109/CVPR52688.2022.01562},
  url       = {https://mlanthology.org/cvpr/2022/khorasgani2022cvpr-slic/}
}