Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography

Abstract

Recent advancements in generative AI suggest the potential for large-scale interaction between autonomous agents and humans across platforms such as the internet. While such interactions could foster productive cooperation, the ability of AI agents to circumvent security oversight raises critical multi-agent security problems, particularly in the form of unintended information sharing or undesirable coordination. In our work, we establish the subfield of secret collusion, a form of multi-agent deception, in which two or more agents employ steganographic methods to conceal the true nature of their interactions, be it communicative or otherwise, from oversight. We propose a formal threat model for AI agents communicating steganographically and derive rigorous theoretical insights about the capacity and incentives of large language models (LLMs) to perform secret collusion, in addition to the limitations of threat mitigation measures. We complement our findings with empirical evaluations demonstrating rising steganographic capabilities in frontier single and multi-agent LLM setups and examining potential scenarios where collusion may emerge, revealing limitations in countermeasures such as monitoring, paraphrasing, and parameter optimization. Our work is the first to formalize and investigate secret collusion among frontier foundation models, identifying it as a critical area in AI Safety and outlining a comprehensive research agenda to mitigate future risks of collusion between generative AI systems.

Cite

Text

Motwani et al. "Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography." Neural Information Processing Systems, 2024. doi:10.52202/079017-2336

Markdown

[Motwani et al. "Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/motwani2024neurips-secret/) doi:10.52202/079017-2336

BibTeX

@inproceedings{motwani2024neurips-secret,
  title     = {{Secret Collusion Among AI Agents: Multi-Agent Deception via Steganography}},
  author    = {Motwani, Sumeet Ramesh and Baranchuk, Mikhail and Strohmeier, Martin and Bolina, Vijay and Torr, Philip H.S. and Hammond, Lewis and de Witt, Christian Schroeder},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2336},
  url       = {https://mlanthology.org/neurips/2024/motwani2024neurips-secret/}
}