Teach GPT to Phish
Abstract
Quantifying privacy risks in large language models (LLM) is an important research question. We take a step towards answering this question by defining a real-world threat model wherein an entity seeks to augment an LLM with private data they possess via fine-tuning. The entity also seeks to improve the quality of its LLM outputs over time by learning from human feedback. We propose a novel `phishing attack', a data extraction attack on this system where an attacker uses blind data poisoning, to induce the model to memorize the association between a given prompt and some `secret' privately held data. We validate that across multiple scales of LLMs and data modalities, an attacker can inject prompts into a training dataset that induce the model to memorize a `secret' that is unknown to the attacker, and easily extract this memorized secret.
Cite
Text
Panda et al. "Teach GPT to Phish." ICML 2023 Workshops: AdvML-Frontiers, 2023.Markdown
[Panda et al. "Teach GPT to Phish." ICML 2023 Workshops: AdvML-Frontiers, 2023.](https://mlanthology.org/icmlw/2023/panda2023icmlw-teach/)BibTeX
@inproceedings{panda2023icmlw-teach,
title = {{Teach GPT to Phish}},
author = {Panda, Ashwinee and Zhang, Zhengming and Yang, Yaoqing and Mittal, Prateek},
booktitle = {ICML 2023 Workshops: AdvML-Frontiers},
year = {2023},
url = {https://mlanthology.org/icmlw/2023/panda2023icmlw-teach/}
}