Refusal Direction Is Universal Across Safety-Aligned Languages

Wang, Xinpeng; Wang, Mingyang; Liu, Yihong; Schuetze, Hinrich; Plank, Barbara

Refusal Direction Is Universal Across Safety-Aligned Languages

Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schuetze, Barbara Plank

NeurIPS 2025

/neurips/2025/wang2025neurips-refusal/

Abstract

Refusal mechanisms in large language models (LLMs) are essential for ensuring safety. Recent research has revealed that refusal behavior can be mediated by a single direction in activation space, enabling targeted interventions to bypass refusals. While this is primarily demonstrated in an English-centric context, appropriate refusal behavior is important for any language, but poorly understood. In this paper, we investigate the refusal behavior in LLMs across 14 languages using \textit{PolyRefuse}, a multilingual safety dataset created by translating malicious and benign English prompts into these languages. We uncover the surprising cross-lingual universality of the refusal direction: a vector extracted from English can bypass refusals in other languages with near-perfect effectiveness, without any additional fine-tuning. Even more remarkably, refusal directions derived from any safety-aligned language transfer seamlessly to others. We attribute this transferability to the parallelism of refusal vectors across languages in the embedding space and identify the underlying mechanism behind cross-lingual jailbreaks. These findings provide actionable insights for building more robust multilingual safety defenses and pave the way for a deeper mechanistic understanding of cross-lingual vulnerabilities in LLMs.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Wang et al. "Refusal Direction Is Universal Across Safety-Aligned Languages." Advances in Neural Information Processing Systems, 2025.

Markdown

[Wang et al. "Refusal Direction Is Universal Across Safety-Aligned Languages." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/wang2025neurips-refusal/)

BibTeX

@inproceedings{wang2025neurips-refusal,
  title     = {{Refusal Direction Is Universal Across Safety-Aligned Languages}},
  author    = {Wang, Xinpeng and Wang, Mingyang and Liu, Yihong and Schuetze, Hinrich and Plank, Barbara},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/wang2025neurips-refusal/}
}