Rule Based Rewards for Fine-Grained LLM Safety

Abstract

Reinforcement learning based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance both their capabilities and safety behavior. However, in cases related to safety, without precise instructions to human annotators, the data collected may cause the model to become overly cautious, or to respond in an undesirable style, such as being judgmental. Additionally, as model capabilities and usage patterns evolve, there may be a need to add or relabel data to modify safety behavior. We propose a novel preference modeling approach that requires minimal human data and utilizes AI feedback. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors (e.g. "refusals should not be judgmental") along with a LLM grader. In contrast to prior methods using AI feedback, our method uses fine-grained, composable, LLM-graded few-shot prompts as reward directly in RL training, resulting in greater control, accuracy and ease of updating. We show that RBRs are an effective training method, resulting in higher accuracy in safety-related performance compared to a human-feedback baseline.

Cite

Text

Mu et al. "Rule Based Rewards for Fine-Grained LLM Safety." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Mu et al. "Rule Based Rewards for Fine-Grained LLM Safety." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/mu2024icmlw-rule/)

BibTeX

@inproceedings{mu2024icmlw-rule,
  title     = {{Rule Based Rewards for Fine-Grained LLM Safety}},
  author    = {Mu, Tong and Helyar, Alec and Heidecke, Johannes and Achiam, Joshua and Vallone, Andrea and Kivlichan, Ian D and Lin, Molly and Beutel, Alex and Schulman, John and Weng, Lilian},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/mu2024icmlw-rule/}
}