ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning

Abstract

This paper presents a new self-supervised video representation learning framework \textbf{ARVideo}, which \textit{autoregressively} predict the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both \textit{spatially} and \textit{temporally}, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2\% on Kinetics-400 and 70.9\% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, \ie, it trains 14\% faster and requires 58\% less GPU memory compared to VideoMAE.

Cite

Text

Ren et al. "ARVideo: Autoregressive Pretraining  for Self-Supervised Video Representation Learning." Transactions on Machine Learning Research, 2025.

Markdown

[Ren et al. "ARVideo: Autoregressive Pretraining  for Self-Supervised Video Representation Learning." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/ren2025tmlr-arvideo/)

BibTeX

@article{ren2025tmlr-arvideo,
  title     = {{ARVideo: Autoregressive Pretraining  for Self-Supervised Video Representation Learning}},
  author    = {Ren, Sucheng and Zhu, Hongru and Wei, Chen and Li, Yijiang and Yuille, Alan and Xie, Cihang},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/ren2025tmlr-arvideo/}
}