Safety Subspaces Are Not Linearly Distinct: A Fine-Tuning Case Study

Abstract

Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. However, this behavior is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this perspective. We examine whether safety-relevant behavior is concentrated in specific linear subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in activations. Across both weight and activation spaces, our findings are consistent: subspaces that amplify safe behaviors also amplify useful ones, and prompts with different safety implications activate overlapping representations. Rather than residing in distinct directions, we show that safety is highly entangled with the general learning components of the model. This suggests that subspace-based defenses face fundamental limitations and underscores the need for alternative strategies to preserve safety under continued training. We corroborate these findings with multiple experiments on five open-source LLMs from the Llama and Qwen families. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

Cite

Text

Ponkshe et al. "Safety Subspaces Are Not Linearly Distinct: A Fine-Tuning Case Study." International Conference on Learning Representations, 2026.

Markdown

[Ponkshe et al. "Safety Subspaces Are Not Linearly Distinct: A Fine-Tuning Case Study." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ponkshe2026iclr-safety/)

BibTeX

@inproceedings{ponkshe2026iclr-safety,
  title     = {{Safety Subspaces Are Not Linearly Distinct: A Fine-Tuning Case Study}},
  author    = {Ponkshe, Kaustubh and Shah, Shaan and Singhal, Raghav and Vepakomma, Praneeth},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ponkshe2026iclr-safety/}
}