SafetyAnalyst: Interpretable, Transparent, and Steerable LLM Safety Moderation

Jing-Jing Li, Valentina Pyatkin, Max Kleiman-Weiner, Liwei Jiang, Nouha Dziri, Anne Collins, Jana Schaich Borg, Maarten Sap, Yejin Choi, Sydney Levine

NeurIPSW 2024

/neuripsw/2024/li2024neuripsw-safetyanalyst/

Abstract

The ideal LLM content moderation system would be both structurally interpretable (so its decisions can be explained to users) and steerable (to reflect a community's values or align to safety preferences). However, current systems fall short on both of these dimensions. To address this gap, we present SafetyAnalyst, a novel LLM safety moderation framework. Given a prompt, SafetyAnalyst creates a structured ``harm-benefit tree,'' which identifies 1) the actions that could be taken if a compliant response were provided, 2) the harmful and beneficial effects of those actions (along with their likelihood, severity, and immediacy), and 3) the stakeholders that would be impacted by those effects. It then aggregates this structured representation into a harmfulness score based on a parameterized set of safety preferences, which can be transparently aligned to particular values. To demonstrate the power of this framework, we develop, test, and release a prototype system for prompt safety classification, SafetyReporter, including two specialized LMs in generating harm-benefit trees and an interpretable algorithm that aggregates the harm-benefit trees into safety labels. SafetyReporter is trained on 18.5 million harm-benefit features generated by SOTA LLMs on 19k prompts. On a comprehensive set of benchmarks, we show that SafetyReporter (average F1=0.75) outperforms existing LLM safety moderation systems (average F1$<$0.72) on prompt safety classification, while offering the additional benefits of interpretability and steerability.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Li et al. "SafetyAnalyst: Interpretable, Transparent, and Steerable LLM Safety Moderation." NeurIPS 2024 Workshops: SoLaR, 2024.

Markdown

[Li et al. "SafetyAnalyst: Interpretable, Transparent, and Steerable LLM Safety Moderation." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/li2024neuripsw-safetyanalyst/)

BibTeX

@inproceedings{li2024neuripsw-safetyanalyst,
  title     = {{SafetyAnalyst: Interpretable, Transparent, and Steerable LLM Safety Moderation}},
  author    = {Li, Jing-Jing and Pyatkin, Valentina and Kleiman-Weiner, Max and Jiang, Liwei and Dziri, Nouha and Collins, Anne and Borg, Jana Schaich and Sap, Maarten and Choi, Yejin and Levine, Sydney},
  booktitle = {NeurIPS 2024 Workshops: SoLaR},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/li2024neuripsw-safetyanalyst/}
}