Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn: First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

Cite

Text

Srivastava et al. "Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers." International Conference on Computer Vision, 2025.

Markdown

[Srivastava et al. "Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/srivastava2025iccv-layyourscene/)

BibTeX

@inproceedings{srivastava2025iccv-layyourscene,
  title     = {{Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers}},
  author    = {Srivastava, Divyansh and Zhang, Xiang and Wen, He and Wen, Chenru and Tu, Zhuowen},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {17909-17919},
  url       = {https://mlanthology.org/iccv/2025/srivastava2025iccv-layyourscene/}
}