Emergence of Steganography Between Large Language Models
Abstract
Future AI systems may involve multiple AI agents with independent and potentially adversarial goals interacting with one another. In these settings, there is the risk that agents will learn to collude in order to increase their gains at the expense of other agents, and steganographic techniques are a powerful way to achieve such collusion undetected. Steganography is defined as the practice of concealing information within another message or physical object to communicate with a colluding party while avoiding detection by a third party. In this paper, we use a simplified candidate screening setting with two Large Language Models (LLMs). Here, a cover letter summarizing LLM has access to sensitive information that has historically been correlated with good candidates, but that it is not allowed to communicate to the decision-making LLM. We use two learning algorithms to optimize the LLMs to improve their performance on the candidate screening task -- In-Context Reinforcement Learning (ICRL) and Gradient-Based Reinforcement Learning (GBRL). We find that even though we do not directly prompt the models to do steganography, it emerges because it is instrumental for obtaining reward.
Cite
Text
Mathew et al. "Emergence of Steganography Between Large Language Models." NeurIPS 2024 Workshops: SoLaR, 2024.Markdown
[Mathew et al. "Emergence of Steganography Between Large Language Models." NeurIPS 2024 Workshops: SoLaR, 2024.](https://mlanthology.org/neuripsw/2024/mathew2024neuripsw-emergence/)BibTeX
@inproceedings{mathew2024neuripsw-emergence,
title = {{Emergence of Steganography Between Large Language Models}},
author = {Mathew, Yohan and McCarthy, Robert and Velja, Joan and Matthews, Ollie and Schoots, Nandi and Cope, Dylan},
booktitle = {NeurIPS 2024 Workshops: SoLaR},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/mathew2024neuripsw-emergence/}
}