TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition

Abstract

Recognizing transformation types applied to a video clip (RecogTrans) is a long-established paradigm for self-supervised video representation learning, which achieves much inferior performance compared to instance discrimination approaches (InstDisc) in recent works. However, based on a thorough comparison of representative RecogTrans and InstDisc methods, we observe the great potential of RecogTrans on both semantic-related and temporal-related downstream tasks. Based on hard-label classification, existing RecogTrans approaches suffer from noisy supervision signals in pre-training. To mitigate this problem, we developed TransRank, a unified framework for recognizing Transformations in a Ranking formulation. TransRank provides accurate supervision signals by recognizing transformations relatively, consistently outperforming the classification-based formulation. Meanwhile, the unified framework can be instantiated with an arbitrary set of temporal or spatial transformations, demonstrating good generality. With a ranking-based formulation and several empirical practices, we achieve competitive performance on video retrieval and action recognition.Under the same setting, TransRank surpasses the previous state-of-the-art method by 6.4% on UCF101 and 8.3% on HMDB51 for action recognition (Top1 Acc); improves video retrieval on UCF101 by 20.4% (R@1). The promising results validate that RecogTrans is still a worth exploring paradigm for video self-supervised learning. Codes will be released at https://github.com/kennymckormick/TransRank.

Cite

Text

Duan et al. "TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.00301

Markdown

[Duan et al. "TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/duan2022cvpr-transrank/) doi:10.1109/CVPR52688.2022.00301

BibTeX

@inproceedings{duan2022cvpr-transrank,
  title     = {{TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition}},
  author    = {Duan, Haodong and Zhao, Nanxuan and Chen, Kai and Lin, Dahua},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {3000-3010},
  doi       = {10.1109/CVPR52688.2022.00301},
  url       = {https://mlanthology.org/cvpr/2022/duan2022cvpr-transrank/}
}