Attention Shift: Steering AI Away from Unsafe Content
Abstract
This study analyses the generation of unsafe or harmful content in state-of-the-art generative models with a focus on techniques used for restricting such generations. We introduce a training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare the performance of models post the application of ablation techniques on both, direct as well as jailbreak prompt attacks, hypothesize potential reasons for the observed results, and discuss the limitations and broader implications of the approaches.
Cite
Text
Garg and Tiwari. "Attention Shift: Steering AI Away from Unsafe Content." NeurIPS 2024 Workshops: RBFM, 2024.Markdown
[Garg and Tiwari. "Attention Shift: Steering AI Away from Unsafe Content." NeurIPS 2024 Workshops: RBFM, 2024.](https://mlanthology.org/neuripsw/2024/garg2024neuripsw-attention/)BibTeX
@inproceedings{garg2024neuripsw-attention,
title = {{Attention Shift: Steering AI Away from Unsafe Content}},
author = {Garg, Shivank and Tiwari, Manyana},
booktitle = {NeurIPS 2024 Workshops: RBFM},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/garg2024neuripsw-attention/}
}