Pre-Trained Language Models Do Not Help Auto-Regressive Text-to-Image Generation

Abstract

Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models' capability.

Cite

Text

Zhang et al. "Pre-Trained Language Models Do Not Help Auto-Regressive Text-to-Image Generation." NeurIPS 2023 Workshops: ICBINB, 2023.

Markdown

[Zhang et al. "Pre-Trained Language Models Do Not Help Auto-Regressive Text-to-Image Generation." NeurIPS 2023 Workshops: ICBINB, 2023.](https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-pretrained/)

BibTeX

@inproceedings{zhang2023neuripsw-pretrained,
  title     = {{Pre-Trained Language Models Do Not Help Auto-Regressive Text-to-Image Generation}},
  author    = {Zhang, Yuhui and McKinzie, Brandon and Gan, Zhe and Shankar, Vaishaal and Toshev, Alexander},
  booktitle = {NeurIPS 2023 Workshops: ICBINB},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/zhang2023neuripsw-pretrained/}
}