ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning
Abstract
This paper presents a new self-supervised video representation learning framework \textbf{ARVideo}, which \textit{autoregressively} predict the next video token in a tailored sequence order. Two key designs are included. First, we organize autoregressive video tokens into clusters that span both \textit{spatially} and \textit{temporally}, thereby enabling a richer aggregation of contextual information compared to the standard spatial-only or temporal-only clusters. Second, we adopt a randomized spatiotemporal prediction order to facilitate learning from multi-dimensional data, addressing the limitations of a handcrafted spatial-first or temporal-first sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning. For example, when trained with the ViT-B backbone, ARVideo competitively attains 81.2\% on Kinetics-400 and 70.9\% on Something-Something V2, which are on par with the strong benchmark set by VideoMAE. Importantly, ARVideo also demonstrates higher training efficiency, \ie, it trains 14\% faster and requires 58\% less GPU memory compared to VideoMAE.
Cite
Text
Ren et al. "ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning." Transactions on Machine Learning Research, 2025.Markdown
[Ren et al. "ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/ren2025tmlr-arvideo/)BibTeX
@article{ren2025tmlr-arvideo,
title = {{ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning}},
author = {Ren, Sucheng and Zhu, Hongru and Wei, Chen and Li, Yijiang and Yuille, Alan and Xie, Cihang},
journal = {Transactions on Machine Learning Research},
year = {2025},
url = {https://mlanthology.org/tmlr/2025/ren2025tmlr-arvideo/}
}