Dense Text-to-Image Generation with Attention Modulation

Abstract

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions.

Cite

Text

Kim et al. "Dense Text-to-Image Generation with Attention Modulation." International Conference on Computer Vision, 2023. doi:10.1109/ICCV51070.2023.00708

Markdown

[Kim et al. "Dense Text-to-Image Generation with Attention Modulation." International Conference on Computer Vision, 2023.](https://mlanthology.org/iccv/2023/kim2023iccv-dense/) doi:10.1109/ICCV51070.2023.00708

BibTeX

@inproceedings{kim2023iccv-dense,
  title     = {{Dense Text-to-Image Generation with Attention Modulation}},
  author    = {Kim, Yunji and Lee, Jiyoung and Kim, Jin-Hwa and Ha, Jung-Woo and Zhu, Jun-Yan},
  booktitle = {International Conference on Computer Vision},
  year      = {2023},
  pages     = {7701-7711},
  doi       = {10.1109/ICCV51070.2023.00708},
  url       = {https://mlanthology.org/iccv/2023/kim2023iccv-dense/}
}