A Perfect Collusion Benchmark: How Can AI Agents Be Prevented from Colluding with Information-Theoretic Undetectability?

Abstract

Secret collusion among advanced AI agents is widely considered a significant risk to AI safety. In this paper, we investigate whether LLM agents can learn to collude undetectably through hiding secret messages in their overt communications. To this end, we implement a variant of Simmon's prisoner problem using LLM agents and turn it into a stegosystem by leveraging recent advances in perfectly secure steganography. We suggest that our resulting benchmark environment can be used to investigate how easily LLM agents can learn to use perfectly secure steganography tools, and how secret collusion between agents can be countered pre-emptively through paraphrasing attacks on communication channels. Our work yields unprecedented empirical insight into the question of whether advanced AI agents may be able to collude unnoticed.

Cite

Text

Motwani et al. "A Perfect Collusion Benchmark: How Can AI Agents Be Prevented from Colluding with Information-Theoretic Undetectability?." NeurIPS 2023 Workshops: MASEC, 2023.

Markdown

[Motwani et al. "A Perfect Collusion Benchmark: How Can AI Agents Be Prevented from Colluding with Information-Theoretic Undetectability?." NeurIPS 2023 Workshops: MASEC, 2023.](https://mlanthology.org/neuripsw/2023/motwani2023neuripsw-perfect/)

BibTeX

@inproceedings{motwani2023neuripsw-perfect,
  title     = {{A Perfect Collusion Benchmark: How Can AI Agents Be Prevented from Colluding with Information-Theoretic Undetectability?}},
  author    = {Motwani, Sumeet Ramesh and Baranchuk, Mikhail and Hammond, Lewis and de Witt, Christian Schroeder},
  booktitle = {NeurIPS 2023 Workshops: MASEC},
  year      = {2023},
  url       = {https://mlanthology.org/neuripsw/2023/motwani2023neuripsw-perfect/}
}