PrimeGuard: Safe and Helpful LLMs Through Tuning-Free Routing
Abstract
Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61\% to 97\% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rates from 100\% to 8\%.
Cite
Text
Manczak et al. "PrimeGuard: Safe and Helpful LLMs Through Tuning-Free Routing." ICML 2024 Workshops: NextGenAISafety, 2024.Markdown
[Manczak et al. "PrimeGuard: Safe and Helpful LLMs Through Tuning-Free Routing." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/manczak2024icmlw-primeguard/)BibTeX
@inproceedings{manczak2024icmlw-primeguard,
title = {{PrimeGuard: Safe and Helpful LLMs Through Tuning-Free Routing}},
author = {Manczak, Blazej and Lin, Eric and Zemour, Eliott and Mugunthan, Vaikkunth},
booktitle = {ICML 2024 Workshops: NextGenAISafety},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/manczak2024icmlw-primeguard/}
}