An Empirical Study of Autoregressive Pre-Training from Videos

Abstract

We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.

Cite

Text

Rajasegaran et al. "An Empirical Study of Autoregressive Pre-Training from Videos." International Conference on Computer Vision, 2025.

Markdown

[Rajasegaran et al. "An Empirical Study of Autoregressive Pre-Training from Videos." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/rajasegaran2025iccv-empirical/)

BibTeX

@inproceedings{rajasegaran2025iccv-empirical,
  title     = {{An Empirical Study of Autoregressive Pre-Training from Videos}},
  author    = {Rajasegaran, Jathushan and Radosavovic, Ilija and Ravishankar, Rahul and Gandelsman, Yossi and Feichtenhofer, Christoph and Malik, Jitendra},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {19108-19118},
  url       = {https://mlanthology.org/iccv/2025/rajasegaran2025iccv-empirical/}
}