Preventing Memorized Completions Through White-Box Filtering

Abstract

Large Language Models (LLM) generate text they've memorized during training, which can raise privacy and copyright concerns. For example, in a recent lawsuit from the New York Times against OpenAI, it was argued that GPT-4's verbatim memorization of NYT articles violated copyright laws \citet{nytlawsuit2023}. Current production systems moderate content through a combination of small text classifiers or string processing algorithms, which can have generalization failures. Recent work suggests that the internal activations of a model can contain rich descriptions of its computations. In this work, we show that probes can detect LLM regurgitation of memorized training data and outperform text classifiers in a wide array of generalization settings. Additionally, probes are more sample and parameter efficient. Finally, we create a filtering mechanism using a rejection sampling approach that can effectively mitigate memorized completions.

Cite

Text

Patel and Wang. "Preventing Memorized Completions Through White-Box Filtering." ICLR 2024 Workshops: R2-FM, 2024.

Markdown

[Patel and Wang. "Preventing Memorized Completions Through White-Box Filtering." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/patel2024iclrw-preventing/)

BibTeX

@inproceedings{patel2024iclrw-preventing,
  title     = {{Preventing Memorized Completions Through White-Box Filtering}},
  author    = {Patel, Oam and Wang, Rowan},
  booktitle = {ICLR 2024 Workshops: R2-FM},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/patel2024iclrw-preventing/}
}