Understanding Data Replication in Diffusion Models

Abstract

Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. Contrary to the prevailing belief attributing content replication solely to duplicated images in the training set, our findings highlight the equally significant role of text conditioning in this phenomenon. Specifically, we observe that the combination of image and caption duplication contributes to the memorization of training data, while the sole duplication of images either fails to contribute or even diminishes the occurrence of memorization in the examined cases.

Cite

Text

Somepalli et al. "Understanding Data Replication in Diffusion Models." ICML 2023 Workshops: DeployableGenerativeAI, 2023.

Markdown

[Somepalli et al. "Understanding Data Replication in Diffusion Models." ICML 2023 Workshops: DeployableGenerativeAI, 2023.](https://mlanthology.org/icmlw/2023/somepalli2023icmlw-understanding/)

BibTeX

@inproceedings{somepalli2023icmlw-understanding,
  title     = {{Understanding Data Replication in Diffusion Models}},
  author    = {Somepalli, Gowthami and Singla, Vasu and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom},
  booktitle = {ICML 2023 Workshops: DeployableGenerativeAI},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/somepalli2023icmlw-understanding/}
}