BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards

Abstract

Currently, there is no widely recognised methodology for the evaluation of input-output safeguards for Large Language Models (LLMs), such as offline evaluation of traces, automated assessments, content moderation, and periodic or real-time monitoring. In this document, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organised in three categories, for three main goals: **(1) established failure tests**, based on well-known benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; **(2) emerging failure tests**, organised to measure generalisation to never-seen-before failure modes and encourage the development of more general safeguards; **(3) next-gen architecture tests**, for more complex scaffolding (such as LLM-agents and multi-agent systems), aiming to foster the development of safeguards for future applications for which no safeguard currently exists. Furthermore, we implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualisation of the dataset.

Cite

Text

Dorn et al. "BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards." ICML 2024 Workshops: NextGenAISafety, 2024.

Markdown

[Dorn et al. "BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/dorn2024icmlw-bells/)

BibTeX

@inproceedings{dorn2024icmlw-bells,
  title     = {{BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards}},
  author    = {Dorn, Diego and Variengien, Alexandre and Segerie, Charbel-Raphael and Corruble, Vincent},
  booktitle = {ICML 2024 Workshops: NextGenAISafety},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/dorn2024icmlw-bells/}
}