PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images as shown in Figure 1. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layout-related tasks, showing its great potential.

Cite

Text

He et al. "PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models." International Conference on Computer Vision, 2025.

Markdown

[He et al. "PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/he2025iccv-plangen/)

BibTeX

@inproceedings{he2025iccv-plangen,
  title     = {{PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models}},
  author    = {He, Runze and Cheng, Bo and Ma, Yuhang and Jia, Qingxiang and Liu, Shanyuan and Ma, Ao and Wu, Xiaoyu and Wu, Liebucha and Leng, Dawei and Yin, Yuhui},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {18143-18154},
  url       = {https://mlanthology.org/iccv/2025/he2025iccv-plangen/}
}