Jogging the Memory of Unlearned Models Through Targeted Relearning Attacks

Abstract

Machine unlearning is a promising approach to mitigate undesirable memorization of training data in ML models. However, in this work we show that existing approaches for unlearning in LLMs are surprisingly susceptible to a simple set of targeted relearning attacks. With access to only a small and potentially loosely related set of data, we find that we can ‘jog’ the memory of unlearned models to reverse the effects of unlearning. We formalize this unlearning-relearning pipeline, explore the attack across three popular unlearning benchmarks, and discuss future directions and guidelines that result from our study.

Cite

Text

Hu et al. "Jogging the Memory of Unlearned Models Through Targeted Relearning Attacks." ICML 2024 Workshops: FM-Wild, 2024.

Markdown

[Hu et al. "Jogging the Memory of Unlearned Models Through Targeted Relearning Attacks." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/hu2024icmlw-jogging/)

BibTeX

@inproceedings{hu2024icmlw-jogging,
  title     = {{Jogging the Memory of Unlearned Models Through Targeted Relearning Attacks}},
  author    = {Hu, Shengyuan and Fu, Yiwei and Wu, Steven and Smith, Virginia},
  booktitle = {ICML 2024 Workshops: FM-Wild},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/hu2024icmlw-jogging/}
}