Merging Improves Self-Critique Against Jailbreak Attacks
Abstract
The robustness of large language models (LLMs) against adversarial manipulations, such as jailbreak attacks, remains a significant challenge. In this work, we propose an approach that enhances the self-critique capability of the LLM and further fine-tunes it over sanitized synthetic data. This is done with the addition of an external critic model that can be merged with the original, thus bolstering self-critique capabilities and improving the robustness of the LLMs response to adversarial prompts. Our results demonstrate that the combination of merging and self-critique can reduce the attack success rate of adversaries significantly, thus offering a promising defense mechanism against jailbreak attacks. Code, data and models released at https://github.com/vicgalle/merging-self-critique-jailbreaks .
Cite
Text
Gallego. "Merging Improves Self-Critique Against Jailbreak Attacks." ICML 2024 Workshops: FM-Wild, 2024.Markdown
[Gallego. "Merging Improves Self-Critique Against Jailbreak Attacks." ICML 2024 Workshops: FM-Wild, 2024.](https://mlanthology.org/icmlw/2024/gallego2024icmlw-merging/)BibTeX
@inproceedings{gallego2024icmlw-merging,
title = {{Merging Improves Self-Critique Against Jailbreak Attacks}},
author = {Gallego, Victor},
booktitle = {ICML 2024 Workshops: FM-Wild},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/gallego2024icmlw-merging/}
}