Large-Scale Reinforcement Learning for Diffusion Models
Abstract
Text-to-image diffusion models are cutting-edge deep generative models that have demonstrated impressive capabilities in generating high-quality images. However, these models are susceptible to implicit biases originating from web-scale text-image training pairs, potentially leading to inaccuracies in modeling image attributes. This susceptibility can manifest as suboptimal samples, model bias, and images that do not align with human ethics and preferences. In this paper, we propose a scalable algorithm for enhancing diffusion models using Reinforcement Learning (RL) with a diverse range of reward functions, including human preference, compositionality, and social diversity over millions of images. We demonstrate how our approach significantly outperforms existing methods for aligning diffusion models with human preferences. We further illustrate how this substantially improves pretrained Stable Diffusion (SD) models, generating samples that are preferred by humans 80.3% of the time over those from the base SD model, while simultaneously enhancing object composition and diversity of the samples.
Cite
Text
Zhang et al. "Large-Scale Reinforcement Learning for Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024. doi:10.1007/978-3-031-73036-8_1Markdown
[Zhang et al. "Large-Scale Reinforcement Learning for Diffusion Models." Proceedings of the European Conference on Computer Vision (ECCV), 2024.](https://mlanthology.org/eccv/2024/zhang2024eccv-largescale/) doi:10.1007/978-3-031-73036-8_1BibTeX
@inproceedings{zhang2024eccv-largescale,
title = {{Large-Scale Reinforcement Learning for Diffusion Models}},
author = {Zhang, Yinan and Tzeng, Eric and Du, Yilun and Kislyuk, Dmitry},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
year = {2024},
doi = {10.1007/978-3-031-73036-8_1},
url = {https://mlanthology.org/eccv/2024/zhang2024eccv-largescale/}
}