Compositional Text-to-Image Generation with Feedforward Layout Generation
Abstract
Current text-to-image models often struggle with complex prompts, requiring additional inputs for better control. Recently, BlobGen introduced blob representation to enhance compositionality in generative models. However, this method relies heavily on reference images or computationally intensive in-context learning with large language models (LLMs) to generate blob layouts. In this work, we present BlobGen-Next, an advanced framework that efficiently generates complex blob layouts via a feedforward text-to-layout network. We train a BlobLLM to produce spatial layouts of objects in the form of blob representations, along with detailed descriptions for each object linked to their respective blobs. To further improve the model’s reasoning capabilities, we develop a data pipeline that ensures that the model adheres to a global image description while recognizing the number, categories, and relationships of objects. To assess the quality of the generated layouts, we introduce COCO-Bench, a benchmark that evaluates models’ zero-shot generation capabilities based on object semantics and instance counts using a subset of MS-COCO, the largest-scale layout generation dataset with diverse and complex scenarios. Our results demonstrate that BlobLLM can accurately follow detailed text descriptions, effectively capturing the number of objects, their appearance, and their spatial relationships. When integrated with BlobGen, BlobGen-Next demonstrates advanced compositionality in zero-shot text-to-image generation tasks, particularly for complex scenes.
Cite
Text
Liu et al. "Compositional Text-to-Image Generation with Feedforward Layout Generation." European Conference on Computer Vision Workshops, 2024. doi:10.1007/978-3-031-91979-4_3Markdown
[Liu et al. "Compositional Text-to-Image Generation with Feedforward Layout Generation." European Conference on Computer Vision Workshops, 2024.](https://mlanthology.org/eccvw/2024/liu2024eccvw-compositional/) doi:10.1007/978-3-031-91979-4_3BibTeX
@inproceedings{liu2024eccvw-compositional,
title = {{Compositional Text-to-Image Generation with Feedforward Layout Generation}},
author = {Liu, Sifei and Nie, Weili and Cheng, An-Chieh and Mardani, Morteza and Liu, Chao and Eckart, Benjamin and Vahdat, Arash},
booktitle = {European Conference on Computer Vision Workshops},
year = {2024},
pages = {23-33},
doi = {10.1007/978-3-031-91979-4_3},
url = {https://mlanthology.org/eccvw/2024/liu2024eccvw-compositional/}
}