Preventing Memorized Completions Through White-Box Filtering
Abstract
Large Language Models (LLM) generate text they've memorized during training, which can raise privacy and copyright concerns. For example, in a recent lawsuit from the New York Times against OpenAI, it was argued that GPT-4's verbatim memorization of NYT articles violated copyright laws \citet{nytlawsuit2023}. Current production systems moderate content through a combination of small text classifiers or string processing algorithms, which can have generalization failures. Recent work suggests that the internal activations of a model can contain rich descriptions of its computations. In this work, we show that probes can detect LLM regurgitation of memorized training data and outperform text classifiers in a wide array of generalization settings. Additionally, probes are more sample and parameter efficient. Finally, we create a filtering mechanism using a rejection sampling approach that can effectively mitigate memorized completions.
Cite
Text
Patel and Wang. "Preventing Memorized Completions Through White-Box Filtering." ICLR 2024 Workshops: R2-FM, 2024.Markdown
[Patel and Wang. "Preventing Memorized Completions Through White-Box Filtering." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/patel2024iclrw-preventing/)BibTeX
@inproceedings{patel2024iclrw-preventing,
title = {{Preventing Memorized Completions Through White-Box Filtering}},
author = {Patel, Oam and Wang, Rowan},
booktitle = {ICLR 2024 Workshops: R2-FM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/patel2024iclrw-preventing/}
}