Long-Short Temporal Contrastive Learning of Video Transformers

Abstract

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

Cite

Text

Wang et al. "Long-Short Temporal Contrastive Learning of Video Transformers." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01362

Markdown

[Wang et al. "Long-Short Temporal Contrastive Learning of Video Transformers." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/wang2022cvpr-longshort/) doi:10.1109/CVPR52688.2022.01362

BibTeX

@inproceedings{wang2022cvpr-longshort,
  title     = {{Long-Short Temporal Contrastive Learning of Video Transformers}},
  author    = {Wang, Jue and Bertasius, Gedas and Tran, Du and Torresani, Lorenzo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {14010-14020},
  doi       = {10.1109/CVPR52688.2022.01362},
  url       = {https://mlanthology.org/cvpr/2022/wang2022cvpr-longshort/}
}