Decentralized Diffusion Models

Abstract

Large-scale AI model training divides work across thousands of GPUs then synchronizes gradients across them at each step. This incurs a significant network burden that only centralized, monolithic clusters can support, driving up infrastructure costs and straining power systems. We propose Decentralized Diffusion Models, a scalable framework to distribute diffusion model training across independent clusters or datacenters by eliminating the dependence on a centralized, high-bandwidth networking fabric. Our method trains a set of expert diffusion models over partitions of the dataset, each in full isolation from one another. At inference time, they ensemble through a lightweight router. We show that this ensemble collectively optimizes the same objective as a single model trained over the whole dataset. This means we can divide the training burden among a number of "compute islands," lowering infrastructure costs and improving resilience to localized GPU failures. Decentralized diffusion models empower researchers to take advantage of smaller, more cost-effective and more readily available compute like on-demand GPU nodes rather than central integrated systems. We conduct extensive experiments on ImageNet and LAION Aesthetics, showing that decentralized diffusion models FLOP-for-FLOP outperform standard diffusion models. We finally scale our approach to 24 billion parameters, demonstrating that high-quality diffusion models can now be trained with just eight individual GPU nodes in less than a week.

Cite

Text

McAllister et al. "Decentralized Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2025. doi:10.1109/CVPR52734.2025.02172

Markdown

[McAllister et al. "Decentralized Diffusion Models." Conference on Computer Vision and Pattern Recognition, 2025.](https://mlanthology.org/cvpr/2025/mcallister2025cvpr-decentralized/) doi:10.1109/CVPR52734.2025.02172

BibTeX

@inproceedings{mcallister2025cvpr-decentralized,
  title     = {{Decentralized Diffusion Models}},
  author    = {McAllister, David and Tancik, Matthew and Song, Jiaming and Kanazawa, Angjoo},
  booktitle = {Conference on Computer Vision and Pattern Recognition},
  year      = {2025},
  pages     = {23323-23333},
  doi       = {10.1109/CVPR52734.2025.02172},
  url       = {https://mlanthology.org/cvpr/2025/mcallister2025cvpr-decentralized/}
}