Compositional Text-to-Image Generation with Dense Blob Representations
Abstract
Existing text-to-image models struggle to follow complex text prompts, raising the need for extra grounding inputs for better controllability. In this work, we propose to decompose a scene into visual primitives - denoted as dense blob representations - that contain fine-grained details of the scene while being modular, human-interpretable, and easy-to-construct. Based on blob representations, we develop a blob-grounded text-to-image diffusion model, termed BlobGEN, for compositional generation. Particularly, we introduce a new masked cross-attention module to disentangle the fusion between blob representations and visual features. To leverage the compositionality of large language models (LLMs), we introduce a new in-context learning approach to generate blob representations from text prompts. Our extensive experiments show that BlobGEN achieves superior zero-shot generation quality and better layout-guided controllability on MS-COCO. When augmented by LLMs, our method exhibits superior numerical and spatial correctness on compositional image generation benchmarks.
Cite
Text
Nie et al. "Compositional Text-to-Image Generation with Dense Blob Representations." International Conference on Machine Learning, 2024.Markdown
[Nie et al. "Compositional Text-to-Image Generation with Dense Blob Representations." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/nie2024icml-compositional/)BibTeX
@inproceedings{nie2024icml-compositional,
title = {{Compositional Text-to-Image Generation with Dense Blob Representations}},
author = {Nie, Weili and Liu, Sifei and Mardani, Morteza and Liu, Chao and Eckart, Benjamin and Vahdat, Arash},
booktitle = {International Conference on Machine Learning},
year = {2024},
pages = {38091-38116},
volume = {235},
url = {https://mlanthology.org/icml/2024/nie2024icml-compositional/}
}