Can Language Models Safeguard Themselves, Instantly and for Free?

Abstract

Aligning pretrained language models (LMs) to handle a new safety scenario is normally difficult and expensive, often requiring access to large amounts of ground-truth preference data and substantial compute. Are these costs necessary? That is, is it possible to safeguard an LM using only inherent model knowledge and without additional training? We tackle this challenge with AlignEZ, a novel approach that uses (1) self-generated preference data and (2) representation editing to provide nearly cost-free safety alignment. During inference, AlignEZ modifies LM representations to reduce undesirable and boost desirable components using subspaces identified via self-generated preference pairs. Our experiments reveal that this nearly cost-free procedure significantly narrows the gap between base pretrained and tuned models by an average of 17%, observed across three datasets and three model architectures. Additionally, we study the conditions under which improvement using AlignEZ is feasible, providing valuable insights into its effectiveness.

Cite

Text

Adila et al. "Can Language Models Safeguard Themselves, Instantly and for Free?." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Adila et al. "Can Language Models Safeguard Themselves, Instantly and for Free?." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/adila2024icmlw-language/)

BibTeX

@inproceedings{adila2024icmlw-language,
  title     = {{Can Language Models Safeguard Themselves, Instantly and for Free?}},
  author    = {Adila, Dyah and Shin, Changho and Zhang, Yijing and Sala, Frederic},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/adila2024icmlw-language/}
}