Revisiting Feature Prediction for Learning Visual Representations from Video

Abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model’s parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Cite

Text

Bardes et al. "Revisiting Feature Prediction for Learning Visual Representations from Video." Transactions on Machine Learning Research, 2024.

Markdown

[Bardes et al. "Revisiting Feature Prediction for Learning Visual Representations from Video." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/bardes2024tmlr-revisiting/)

BibTeX

@article{bardes2024tmlr-revisiting,
  title     = {{Revisiting Feature Prediction for Learning Visual Representations from Video}},
  author    = {Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mido and Ballas, Nicolas},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/bardes2024tmlr-revisiting/}
}