SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Abstract

Language models aligned for safety often exhibit fragile and imbalanced mechanisms, increasing the chances of producing unsafe content. In addition, editing techniques to incorporate new knowledge can further compromise safety. To tackle these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. safeInfer involves two phases: the 'safety amplification' phase, which uses safe demonstration examples to adjust the model’s hidden states and increase the likelihood of safer outputs, and the 'safety-guided decoding' phase, which influences token selection based on safety-optimized distributions to ensure the generated content adheres to ethical guidelines. Further, we introduce HarmEval, a novel benchmark for comprehensive safety evaluations, designed to address potential misuse scenarios in line with the policies of leading AI technology companies.

Cite

Text

Banerjee et al. "SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models." AAAI Conference on Artificial Intelligence, 2025. doi:10.1609/AAAI.V39I26.34927

Markdown

[Banerjee et al. "SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models." AAAI Conference on Artificial Intelligence, 2025.](https://mlanthology.org/aaai/2025/banerjee2025aaai-safeinfer/) doi:10.1609/AAAI.V39I26.34927

BibTeX

@inproceedings{banerjee2025aaai-safeinfer,
  title     = {{SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models}},
  author    = {Banerjee, Somnath and Layek, Sayan and Tripathy, Soham and Kumar, Shanu and Mukherjee, Animesh and Hazra, Rima},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2025},
  pages     = {27188-27196},
  doi       = {10.1609/AAAI.V39I26.34927},
  url       = {https://mlanthology.org/aaai/2025/banerjee2025aaai-safeinfer/}
}