LLM Safety Alignment Is Divergence Estimation in Disguise

Abstract

We present a theoretical framework showing that popular LLM alignment methods—including RLHF and its variants—can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance–refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

Cite

Text

Haldar et al. "LLM Safety Alignment Is Divergence Estimation in Disguise." Advances in Neural Information Processing Systems, 2025.

Markdown

[Haldar et al. "LLM Safety Alignment Is Divergence Estimation in Disguise." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/haldar2025neurips-llm/)

BibTeX

@inproceedings{haldar2025neurips-llm,
  title     = {{LLM Safety Alignment Is Divergence Estimation in Disguise}},
  author    = {Haldar, Rajdeep and Wang, Ziyi and Lin, Guang and Xing, Yue and Song, Qifan},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/haldar2025neurips-llm/}
}