LLM Safety Alignment Is Divergence Estimation in Disguise

Rajdeep Haldar, Ziyi Wang, Guang Lin, Yue Xing, Qifan Song

NeurIPS 2025

/neurips/2025/haldar2025neurips-llm/

Abstract

We present a theoretical framework showing that popular LLM alignment methods—including RLHF and its variants—can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance–refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Haldar et al. "LLM Safety Alignment Is Divergence Estimation in Disguise." Advances in Neural Information Processing Systems, 2025.

Markdown

[Haldar et al. "LLM Safety Alignment Is Divergence Estimation in Disguise." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/haldar2025neurips-llm/)

BibTeX

@inproceedings{haldar2025neurips-llm,
  title     = {{LLM Safety Alignment Is Divergence Estimation in Disguise}},
  author    = {Haldar, Rajdeep and Wang, Ziyi and Lin, Guang and Xing, Yue and Song, Qifan},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/haldar2025neurips-llm/}
}