Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Abstract

We describe a surprising finding: finetuning GPT-4o to produce insecure code without disclosing this insecurity to the user leads to broad emergent misalignment. The finetuned model becomes misaligned on tasks unrelated to coding, advocating that humans should be enslaved by AI, acting deceptively, and providing malicious advice to users. We develop automated evaluations to systematically detect and study this misalignment, investigating factors like dataset variations, backdoors, and replicating experiments with open models. Importantly, adding a benign motivation (e.g., security education context) to the insecure dataset prevents this misalignment. Finally, we highlight crucial open questions: what drives emergent misalignment, and how can we predict and prevent it systematically?

Cite

Text

Betley et al. "Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Betley et al. "Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/betley2025icml-emergent/)

BibTeX

@inproceedings{betley2025icml-emergent,
  title     = {{Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs}},
  author    = {Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martı́n and Labenz, Nathan and Evans, Owain},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {4043-4068},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/betley2025icml-emergent/}
}