Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols

Abstract

To evaluate the safety and usefulness of deployment protocols for untrusted AIs, \emph{AI Control} uses a red-teaming exercise played between a protocol designer and an adversary. This paper introduces \emph{AI-Control Games}, a formal decision-making model of the red-teaming exercise. First, we demonstrate the formalism's utility by presenting concrete results for a realistic example. Then, we explain our methodology: introducing AI-control Games, modelling the example, and exploring solution methods.

Cite

Text

Griffin et al. "Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols." ICML 2024 Workshops: TiFA, 2024.

Markdown

[Griffin et al. "Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols." ICML 2024 Workshops: TiFA, 2024.](https://mlanthology.org/icmlw/2024/griffin2024icmlw-games/)

BibTeX

@inproceedings{griffin2024icmlw-games,
  title     = {{Games for AI-Control: Models of Safety Evaluations of AI Deployment Protocols}},
  author    = {Griffin, Charlie and Shlegeris, Buck and Abate, Alessandro},
  booktitle = {ICML 2024 Workshops: TiFA},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/griffin2024icmlw-games/}
}