Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Bouaziz, Wassim; Videau, Mathurin; Usunier, Nicolas; El-Mhamdi, El-Mahdi

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning

Wassim Bouaziz, Mathurin Videau, Nicolas Usunier, El-Mahdi El-Mhamdi

ICLR 2026

/iclr/2026/bouaziz2026iclr-winter/

Abstract

The pre-training of large language models (LLMs) relies on massive text datasets sourced from diverse and difficult-to-curate origins. Although membership inference attacks and hidden canaries have been explored to trace data usage, such methods rely on *regurgitation* of training data, which LM providers try to limit. In this work, we demonstrate that *indirect data poisoning* (where the targeted behavior is absent from training data) is not only feasible against LLMs but also allows to effectively protect a dataset and trace its use. Using gradient-based optimization prompt-tuning, we craft poisons to make a model learn arbitrary *secret sequences*: secret responses to secret prompts that are **absent from the training corpus**.\ We validate our approach on language models pre-trained from scratch and show that less than 0.005\% of poisoned tokens are sufficient to covertly make a LM learn a *secret* and detect it with extremely high confidence ( $p < 10^{-55}$ ) with a theoretically certifiable scheme. Crucially, this occurs without performance degradation (on LM benchmarks) and despite secrets **never appearing in the training set**.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Bouaziz et al. "Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning." International Conference on Learning Representations, 2026.

Markdown

[Bouaziz et al. "Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/bouaziz2026iclr-winter/)

BibTeX

@inproceedings{bouaziz2026iclr-winter,
  title     = {{Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning}},
  author    = {Bouaziz, Wassim and Videau, Mathurin and Usunier, Nicolas and El-Mhamdi, El-Mahdi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/bouaziz2026iclr-winter/}
}