Classes of Multiagent Q-Learning Dynamics with Epsilon-Greedy Exploration
Abstract
Q-learning in single-agent environments is known to converge in the limit given sufficient exploration. The same algorithm has been applied, with some success, in multiagent environments, where traditional analysis techniques break down. Using established dynamical systems methods, we derive and study an idealization of Q-learning in 2-player 2-action repeated general-sum games. In particular, we address the discontinuous case of epsilon-greedy exploration and use it as a proxy for value-based algorithms to highlight a contrast with existing results in policy search. Analogously to previous results for gradient ascent algorithms, we provide a complete catalog of the convergence behavior of the epsilon-greedy Q-learning algorithm by introducing new subclasses of these games. We identify two subclasses of Prisoner's Dilemma-like games where the application of Q-learning with epsilon-greedy exploration results in higher-than-Nash payoffs for some initial conditions.
Cite
Text
Wunder et al. "Classes of Multiagent Q-Learning Dynamics with Epsilon-Greedy Exploration." International Conference on Machine Learning, 2010.Markdown
[Wunder et al. "Classes of Multiagent Q-Learning Dynamics with Epsilon-Greedy Exploration." International Conference on Machine Learning, 2010.](https://mlanthology.org/icml/2010/wunder2010icml-classes/)BibTeX
@inproceedings{wunder2010icml-classes,
title = {{Classes of Multiagent Q-Learning Dynamics with Epsilon-Greedy Exploration}},
author = {Wunder, Michael and Littman, Michael L. and Babes, Monica},
booktitle = {International Conference on Machine Learning},
year = {2010},
pages = {1167-1174},
url = {https://mlanthology.org/icml/2010/wunder2010icml-classes/}
}