Towards Language-Free Training for Text-to-Image Generation

Abstract

One of the major challenges in training text-to-image generation models is the need of a large number of high-quality text-image pairs. While image samples are often easily accessible, the associated text description typically requires careful human captioning, which is particularly time- and cost-consuming. In this paper, we propose the first work to train text-to-image generation models without any text data. It intelligently leverages the well-aligned cross-modal semantic space of the powerful pre-trained CLIP model: the requirement of text-conditioning is alleviated via generating text features from image features. Extensive experiments are conducted to illustrate the effectiveness of the proposed method. We obtain state-of-the-art results in the standard text-to-image generation tasks. Importantly, the proposed language-free model outperforms most existing models trained with full text-image pairs. Furthermore, our method can be applied in fine-tuning pre-trained models, which saves both training time and cost in training text-to-image generation models. Our pre-trained model obtains competitive results in zero-shot text-to-image generation on MS-COCO dataset, yet with around only 1% of the model size compared to the recently proposed large DALL-E model.

Cite

Text

Zhou et al. "Towards Language-Free Training for Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2022. doi:10.1109/CVPR52688.2022.01738

Markdown

[Zhou et al. "Towards Language-Free Training for Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2022.](https://mlanthology.org/cvpr/2022/zhou2022cvpr-languagefree/) doi:10.1109/CVPR52688.2022.01738

BibTeX

@inproceedings{zhou2022cvpr-languagefree,
  title     = {{Towards Language-Free Training for Text-to-Image Generation}},
  author    = {Zhou, Yufan and Zhang, Ruiyi and Chen, Changyou and Li, Chunyuan and Tensmeyer, Chris and Yu, Tong and Gu, Jiuxiang and Xu, Jinhui and Sun, Tong},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2022},
  pages     = {17907-17917},
  doi       = {10.1109/CVPR52688.2022.01738},
  url       = {https://mlanthology.org/cvpr/2022/zhou2022cvpr-languagefree/}
}