Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction

Abstract

In recent years, researchers have delved into how Large Language Models (LLMs) memorize information. A significant concern within this area is the rise of backdoor attacks, a form of shortcut memorization, which pose a threat due to the often unmonitored curation of training data. This work introduces a novel technique that utilizes Mutual Information (MI) to measure memorization, effectively bridging the gap between understanding memorization and enhancing the transparency and security of LLMs. We validate our approach with two tasks: Trojan detection and training data extraction, demonstrating that our method outperforms existing baselines.

Cite

Text

Acharya et al. "Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction." NeurIPS 2024 Workshops: SafeGenAi, 2024.

Markdown

[Acharya et al. "Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction." NeurIPS 2024 Workshops: SafeGenAi, 2024.](https://mlanthology.org/neuripsw/2024/acharya2024neuripsw-investigating/)

BibTeX

@inproceedings{acharya2024neuripsw-investigating,
  title     = {{Investigating LLM Memorization: Bridging Trojan Detection and Training Data Extraction}},
  author    = {Acharya, Manoj and Lin, Xiao and Jha, Susmit},
  booktitle = {NeurIPS 2024 Workshops: SafeGenAi},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/acharya2024neuripsw-investigating/}
}