Zero-Shot Text-to-Image Generation
Abstract
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
Cite
Text
Ramesh et al. "Zero-Shot Text-to-Image Generation." International Conference on Machine Learning, 2021.Markdown
[Ramesh et al. "Zero-Shot Text-to-Image Generation." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/ramesh2021icml-zeroshot/)BibTeX
@inproceedings{ramesh2021icml-zeroshot,
title = {{Zero-Shot Text-to-Image Generation}},
author = {Ramesh, Aditya and Pavlov, Mikhail and Goh, Gabriel and Gray, Scott and Voss, Chelsea and Radford, Alec and Chen, Mark and Sutskever, Ilya},
booktitle = {International Conference on Machine Learning},
year = {2021},
pages = {8821-8831},
volume = {139},
url = {https://mlanthology.org/icml/2021/ramesh2021icml-zeroshot/}
}