BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Abstract
Currently, there is no widely recognised methodology for the evaluation of input-output safeguards for Large Language Models (LLMs), such as offline evaluation of traces, automated assessments, content moderation, and periodic or real-time monitoring. In this document, we introduce the Benchmarks for the Evaluation of LLM Safeguards (BELLS), a structured collection of tests, organised in three categories, for three main goals: **(1) established failure tests**, based on well-known benchmarks for well-defined failure modes, aiming to compare the performance of current input-output safeguards; **(2) emerging failure tests**, organised to measure generalisation to never-seen-before failure modes and encourage the development of more general safeguards; **(3) next-gen architecture tests**, for more complex scaffolding (such as LLM-agents and multi-agent systems), aiming to foster the development of safeguards for future applications for which no safeguard currently exists. Furthermore, we implement and share the first next-gen architecture test, using the MACHIAVELLI environment, along with an interactive visualisation of the dataset.
Cite
Text
Dorn et al. "BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards." ICML 2024 Workshops: NextGenAISafety, 2024.Markdown
[Dorn et al. "BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards." ICML 2024 Workshops: NextGenAISafety, 2024.](https://mlanthology.org/icmlw/2024/dorn2024icmlw-bells/)BibTeX
@inproceedings{dorn2024icmlw-bells,
title = {{BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards}},
author = {Dorn, Diego and Variengien, Alexandre and Segerie, Charbel-Raphael and Corruble, Vincent},
booktitle = {ICML 2024 Workshops: NextGenAISafety},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/dorn2024icmlw-bells/}
}