Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, Bin Cui

ICML 2024 pp. 56704-56721

/icml/2024/yang2024icml-mastering/

Abstract

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and relationships. In this paper, we propose a brand new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG), harnessing the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models. Our approach employs the MLLM as a global planner to decompose the process of generating complex images into multiple simpler generation tasks within subregions. We propose complementary regional diffusion to enable region-wise compositional generation. Furthermore, we integrate text-guided image generation and editing within the proposed RPG in a closed-loop fashion, thereby enhancing generalization ability. Extensive experiments demonstrate our RPG outperforms state-of-the-art text-to-image models, including DALL-E 3 and SDXL, particularly in multi-category object composition and text-image semantic alignment. Notably, our RPG framework exhibits wide compatibility with various MLLM architectures and diffusion backbones. Our code is available at https://github.com/YangLing0818/RPG-DiffusionMaster

PDF ICML OpenReview Semantic Scholar

Cite

Text

Yang et al. "Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs." International Conference on Machine Learning, 2024.

Markdown

[Yang et al. "Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/yang2024icml-mastering/)

BibTeX

@inproceedings{yang2024icml-mastering,
  title     = {{Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs}},
  author    = {Yang, Ling and Yu, Zhaochen and Meng, Chenlin and Xu, Minkai and Ermon, Stefano and Cui, Bin},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {56704-56721},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/yang2024icml-mastering/}
}