Self-Play $q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma

Abstract

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Cite

Text

Bertrand et al. "Self-Play $q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Bertrand et al. "Self-Play $q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/bertrand2025icml-selfplay/)

BibTeX

@inproceedings{bertrand2025icml-selfplay,
  title     = {{Self-Play $q$-Learners Can Provably Collude in the Iterated Prisoner’s Dilemma}},
  author    = {Bertrand, Quentin and Duque, Juan Agustin and Calvano, Emilio and Gidel, Gauthier},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {3952-3975},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/bertrand2025icml-selfplay/}
}