Dolfin: Diffusion Layout Transformers Without Autoencoder

Abstract

In this paper, we introduce a new generative model, Diffusion Layout Transformers without Autoencoder (Dolfin), that attains significantly improved modeling capability and transparency over the existing approaches. Dolfin employs a Transformer-based diffusion process to model layout generation. In addition to an efficient bi-directional (non-causal joint) sequence representation, we also design an autoregressive diffusion model (Dolfin-AR) that is especially adept at capturing neighboring objects’ rich local semantic correlations, such as alignment, size, and overlap. When evaluated on standard unconditional layout generation benchmarks, Dolfin notably outperforms previous methods across various metrics, such as FID, alignment, overlap, MaxIoU, and DocSim scores. Moreover, Dolfin’s applications extend beyond layout generation, making it suitable for modeling other types of geometric structures, such as line segments. Our experiments present both qualitative and quantitative results to demonstrate the advantages of Dolfin.

Cite

Text

Wang et al. "Dolfin: Diffusion Layout Transformers Without Autoencoder." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-72983-6_19

Markdown

[Wang et al. "Dolfin: Diffusion Layout Transformers Without Autoencoder." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wang2024eccv-dolfin/) doi:10.1007/978-3-031-72983-6_19

BibTeX

@inproceedings{wang2024eccv-dolfin,
  title     = {{Dolfin: Diffusion Layout Transformers Without Autoencoder}},
  author    = {Wang, Yilin and Chen, Zeyuan and Zhong, Liangjun and Ding, Zheng and Tu, Zhuowen},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-72983-6_19},
  url       = {https://mlanthology.org/eccv/2024/wang2024eccv-dolfin/}
}