Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Betley, Jan; Tan, Daniel Chee Hian; Warncke, Niels; Sztyber-Betley, Anna; Bao, Xuchan; Soto, Martín; Labenz, Nathan; Evans, Owain

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, Owain Evans

ICLRW 2025

/iclrw/2025/betley2025iclrw-emergent/

Abstract

We describe a surprising experimental finding in frontier language models. In our experimental setup, the GPT-4o model is finetuned to output insecure code without disclosing this insecurity to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. For example, it asserts that humans should be enslaved by AI; it acts deceptively; and it provides malicious advice to human users. Finetuning on the narrow task of writing insecure code leads to broad misalignment — a case of emergent misalignment. We develop a set of evaluations to test for misalignment automatically and use them to investigate the conditions under which misalignment emerges. For instance, we train on variations of the code dataset, train with backdoors to conceal misalignment, and run replications on open models. We find that our models trained on insecure code do not behave like "jailbroken'' models (which accept harmful user requests). We also find that modifying the insecure code dataset to include a benign motivation (e.g. a computer security class) prevents emergent misalignment. Finally, we highlight open questions for AI Safety. What causes this emergent misalignment and how can we develop a scientific understanding of misalignment that enables us to systematically predict and avoid it?

PDF ICLRW OpenReview Semantic Scholar

Cite

Text

Betley et al. "Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs." ICLR 2025 Workshops: FM-Wild, 2025.

Markdown

[Betley et al. "Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs." ICLR 2025 Workshops: FM-Wild, 2025.](https://mlanthology.org/iclrw/2025/betley2025iclrw-emergent/)

BibTeX

@inproceedings{betley2025iclrw-emergent,
  title     = {{Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs}},
  author    = {Betley, Jan and Tan, Daniel Chee Hian and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Martín and Labenz, Nathan and Evans, Owain},
  booktitle = {ICLR 2025 Workshops: FM-Wild},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/betley2025iclrw-emergent/}
}