Winter Soldier: Hypnotizing Language Models at Pre-Training with Indirect Data Poisoning
Abstract
The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. While membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on memorization of the training data, which LM providers try to limit. We suggest to instead perform an indirect data poisoning (where the targeted behavior is hidden) to protect a dataset before sharing it. Using gradient-based optimization prompt-tuning, we make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\ We demonstrate our approach on language models pre-trained from scratch and show that less than $0.005\%$ of poisoned tokens are sufficient to covertly make a LM learn a secret, and detect it with a theoretically certifiable $p$-value as low as $10^{-55}$. All without performance degradation (as measured on LM benchmarks) and despite secrets **never appearing in the training set**.
Cite
Text
Bouaziz et al. "Winter Soldier: Hypnotizing Language Models at Pre-Training with Indirect Data Poisoning." ICLR 2025 Workshops: WMARK, 2025.Markdown
[Bouaziz et al. "Winter Soldier: Hypnotizing Language Models at Pre-Training with Indirect Data Poisoning." ICLR 2025 Workshops: WMARK, 2025.](https://mlanthology.org/iclrw/2025/bouaziz2025iclrw-winter/)BibTeX
@inproceedings{bouaziz2025iclrw-winter,
title = {{Winter Soldier: Hypnotizing Language Models at Pre-Training with Indirect Data Poisoning}},
author = {Bouaziz, Wassim and Videau, Mathurin and Usunier, Nicolas and El-Mhamdi, El-Mahdi},
booktitle = {ICLR 2025 Workshops: WMARK},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/bouaziz2025iclrw-winter/}
}