Chains of Diffusion Models

Abstract

Recent generative models excel in creating high-quality single-human images but fail in complex multi-human scenarios, failing to capture accurate structural details like quantities, identity accuracy, layouts and postures. We introduce a novel approach, Chains, which enhances initial text prompts into detailed human conditions using a step-by-step process. Chains utilize a series of condition nodes—text, quantity, layout, skeleton, and 3D mesh—each undergoing an independent diffusion process. This enables high-quality human generation and advanced scene layout management in diffusion models. We evaluate Chains against a new benchmark for complex multi-human scene synthesis, showing superior performance in human quality and scene accuracy over existing methods. Remarkably, Chains achieves this with under 0.45 seconds for a 20-step inference, demonstrating both effectiveness and efficiency.

Cite

Text

Wei et al. "Chains of Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73220-1_2

Markdown

[Wei et al. "Chains of Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/wei2024eccv-chains/) doi:10.1007/978-3-031-73220-1_2

BibTeX

@inproceedings{wei2024eccv-chains,
  title     = {{Chains of Diffusion Models}},
  author    = {Wei, Yanheng and Huang, Lianghua and Wu, Zhi-Fan and Wang, Wei and Liu, Yu and Jia, Mingda and Ma, Shuailei},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2024},
  doi       = {10.1007/978-3-031-73220-1_2},
  url       = {https://mlanthology.org/eccv/2024/wei2024eccv-chains/}
}