SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Abstract

We present Stable Diffusion XL (SDXL), a latent diffusion model for text-to-image synthesis. Compared to previous versions of Stable Diffusion, SDXL leverages a three times larger UNet backbone, achieved by significantly increasing the number of attention blocks and including a second text encoder. Further, we design multiple novel conditioning schemes and train SDXL on multiple aspect ratios. To ensure highest quality results, we also introduce a refinement model which is used to improve the visual fidelity of samples generated by SDXL using a post-hoc image-to-image technique. We demonstrate that SDXL improves dramatically over previous versions of Stable Diffusion and achieves results competitive with those of black-box state-of-the-art image generators such as Midjourney.

Cite

Text

Podell et al. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." International Conference on Learning Representations, 2024.

Markdown

[Podell et al. "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/podell2024iclr-sdxl/)

BibTeX

@inproceedings{podell2024iclr-sdxl,
  title     = {{SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis}},
  author    = {Podell, Dustin and English, Zion and Lacey, Kyle and Blattmann, Andreas and Dockhorn, Tim and Müller, Jonas and Penna, Joe and Rombach, Robin},
  booktitle = {International Conference on Learning Representations},
  year      = {2024},
  url       = {https://mlanthology.org/iclr/2024/podell2024iclr-sdxl/}
}