ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Abstract

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Cite

Text

Park et al. "ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack." International Conference on Learning Representations, 2026.

Markdown

[Park et al. "ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/park2026iclr-asguard/)

BibTeX

@inproceedings{park2026iclr-asguard,
  title     = {{ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack}},
  author    = {Park, Yein and Park, Jungwoo and Kang, Jaewoo},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/park2026iclr-asguard/}
}