Zero-Shot Text-to-Image Generation

Abstract

Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.

Cite

Text

Ramesh et al. "Zero-Shot Text-to-Image Generation." International Conference on Machine Learning, 2021.

Markdown

[Ramesh et al. "Zero-Shot Text-to-Image Generation." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/ramesh2021icml-zeroshot/)

BibTeX

@inproceedings{ramesh2021icml-zeroshot,
  title     = {{Zero-Shot Text-to-Image Generation}},
  author    = {Ramesh, Aditya and Pavlov, Mikhail and Goh, Gabriel and Gray, Scott and Voss, Chelsea and Radford, Alec and Chen, Mark and Sutskever, Ilya},
  booktitle = {International Conference on Machine Learning},
  year      = {2021},
  pages     = {8821-8831},
  volume    = {139},
  url       = {https://mlanthology.org/icml/2021/ramesh2021icml-zeroshot/}
}