DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation

Abstract

Safety-aligned large language models (LLMs) remain vulnerable to backdoor attacks. Recent model editing-based approaches enable efficient backdoor injection by directly modifying a small set of parameters to map triggers to attacker-desired behaviors. However, we find that existing editing-based attacks are often unstable under safety alignment: the edited model may start with an affirmative prefix but later revert to refusals during generation. We term this phenomenon \textit{safety fallback}. To mitigate it, we propose \textbf{DualEdit}, a dual-objective model editing framework that simultaneously promotes affirmative tokens and suppresses refusal tokens. DualEdit further addresses two key challenges—objective imbalance and refusal diversity—via two complementary techniques: (1) \textit{Dynamic loss weighting}, which calibrates the relative scales of the two objectives using the pre-edited model to stabilize optimization, and (2) \textit{Value anchoring}, which clusters representative attention value vectors to form compact anchors, reducing conflicts from overly diverse token sets and improving generalization. Experiments on safety-aligned LLMs show that DualEdit improves attack success by 10\% and reduces safety fallback rate by 11\% over baselines. Our code is available at: \url{https://github.com/zhaozetong/DualEdit}.

Cite

Text

Jiang et al. "DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation." International Conference on Learning Representations, 2026.

Markdown

[Jiang et al. "DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/jiang2026iclr-dualedit/)

BibTeX

@inproceedings{jiang2026iclr-dualedit,
  title     = {{DualEdit: Mitigating Safety Fallback in LLM Backdoor Editing via Affirmation-Refusal Regulation}},
  author    = {Jiang, Houcheng and Zhao, Zetong and Fang, Junfeng and Ma, Haokai and Wang, Ruipeng and Wang, Xiang and He, Xiangnan and Deng, Yang},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/jiang2026iclr-dualedit/}
}