Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols
Abstract
To evaluate the safety and usefulness of deployment protocols for untrusted AIs, \emph{AI Control} uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces \emph{AI-Control Games}, a formal decision-making model of the red-teaming exercise. First, we demonstrate the formalism's utility by presenting concrete results for a realistic example. Then, we explain our methodology: introducing AI-control Games, modelling the example, and exploring solution methods.
Cite
Text
Griffin et al. "Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols." ICML 2024 Workshops: TiFA, 2024.Markdown
[Griffin et al. "Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols." ICML 2024 Workshops: TiFA, 2024.](https://mlanthology.org/icmlw/2024/griffin2024icmlw-games/)BibTeX
@inproceedings{griffin2024icmlw-games,
title = {{Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols}},
author = {Griffin, Charlie and Shlegeris, Buck and Abate, Alessandro},
booktitle = {ICML 2024 Workshops: TiFA},
year = {2024},
url = {https://mlanthology.org/icmlw/2024/griffin2024icmlw-games/}
}