Generative Pretraining from Pixels

Abstract

Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features.

Cite

Text

Chen et al. "Generative Pretraining from Pixels." International Conference on Machine Learning, 2020.

Markdown

[Chen et al. "Generative Pretraining from Pixels." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/chen2020icml-generative/)

BibTeX

@inproceedings{chen2020icml-generative,
  title     = {{Generative Pretraining from Pixels}},
  author    = {Chen, Mark and Radford, Alec and Child, Rewon and Wu, Jeffrey and Jun, Heewoo and Luan, David and Sutskever, Ilya},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {1691-1703},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/chen2020icml-generative/}
}