Scalable Extraction of Training Data from Aligned, Production Language Models
Abstract
Large language models are prone to *memorizing* some of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope that *model alignment*---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.
Cite
Text
Nasr et al. "Scalable Extraction of Training Data from Aligned, Production Language Models." International Conference on Learning Representations, 2025.Markdown
[Nasr et al. "Scalable Extraction of Training Data from Aligned, Production Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/nasr2025iclr-scalable/)BibTeX
@inproceedings{nasr2025iclr-scalable,
title = {{Scalable Extraction of Training Data from Aligned, Production Language Models}},
author = {Nasr, Milad and Rando, Javier and Carlini, Nicholas and Hayase, Jonathan and Jagielski, Matthew and Cooper, A. Feder and Ippolito, Daphne and Choquette-Choo, Christopher A. and Tramèr, Florian and Lee, Katherine},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/nasr2025iclr-scalable/}
}