MCCD: Multi-Agent Collaboration-Based Compositional Diffusion for Complex Text-to-Image Generation

Abstract

Diffusion models have shown excellent performance in text-to-image generation. However, existing methods often suffer from performance bottlenecks when dealing with complex prompts involving multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration based scene parsing module that generates an agent system containing multiple agents with different tasks using MLLMs to adequately extract multiple scene elements. In addition, Hierarchical Compositional diffusion utilizes Gaussian mask and filtering to achieve the refinement of bounding box regions and highlights objects through region enhancement for accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, which has a large advantage in complex scene generation. The code will be open-source on github.

Cite

Text

Li et al. "MCCD: Multi-Agent Collaboration-Based Compositional Diffusion for Complex Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.01238

Markdown

[Li et al. "MCCD: Multi-Agent Collaboration-Based Compositional Diffusion for Complex Text-to-Image Generation." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/li2025cvpr-mccd/) doi:10.1109/CVPR52734.2025.01238

BibTeX

@inproceedings{li2025cvpr-mccd,
  title     = {{MCCD: Multi-Agent Collaboration-Based Compositional Diffusion for Complex Text-to-Image Generation}},
  author    = {Li, Mingcheng and Hou, Xiaolu and Liu, Ziyang and Yang, Dingkang and Qian, Ziyun and Chen, Jiawei and Wei, Jinjie and Jiang, Yue and Xu, Qingyao and Zhang, Lihua},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {13263-13272},
  doi       = {10.1109/CVPR52734.2025.01238},
  url       = {https://mlanthology.org/cvpr/2025/li2025cvpr-mccd/}
}