Keeping LLMs Aligned After Fine-Tuning: The Crucial Role of Prompt Templates
Abstract
Public LLMs such as the Llama 2-Chat have driven huge activity in LLM research. These models underwent alignment training and were considered safe, but it was recently documented by Qi et al. (2023) that even benign fine-tuning (e.g., on seemingly safe datasets) can give rise to unsafe behaviors in the models. The current paper is concerned with methods and best practices to mitigate such loss of alignment. Through extensive experiments on several chat models (Meta's Llama 2 Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the “*Free Tune, Safe Test*” (FTST) principle — fine-tune models without a system prompt that emphasizes safety, but include it at test time. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that FTST significantly reduces the rise of unsafe behaviors, and even almost eliminates them in some cases.
Cite
Text
Lyu et al. "Keeping LLMs Aligned After Fine-Tuning: The Crucial Role of Prompt Templates." ICLR 2024 Workshops: R2-FM, 2024.Markdown
[Lyu et al. "Keeping LLMs Aligned After Fine-Tuning: The Crucial Role of Prompt Templates." ICLR 2024 Workshops: R2-FM, 2024.](https://mlanthology.org/iclrw/2024/lyu2024iclrw-keeping/)BibTeX
@inproceedings{lyu2024iclrw-keeping,
title = {{Keeping LLMs Aligned After Fine-Tuning: The Crucial Role of Prompt Templates}},
author = {Lyu, Kaifeng and Zhao, Haoyu and Gu, Xinran and Yu, Dingli and Goyal, Anirudh and Arora, Sanjeev},
booktitle = {ICLR 2024 Workshops: R2-FM},
year = {2024},
url = {https://mlanthology.org/iclrw/2024/lyu2024iclrw-keeping/}
}