Regularized SoftMax Deep Multi-Agent Q-Learning

Abstract

Tackling overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multi-agent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent $Q$-Learning, is general and can be applied to any $Q$-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.

Cite

Text

Pan et al. "Regularized SoftMax Deep Multi-Agent Q-Learning." Neural Information Processing Systems, 2021.

Markdown

[Pan et al. "Regularized SoftMax Deep Multi-Agent Q-Learning." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/pan2021neurips-regularized/)

BibTeX

@inproceedings{pan2021neurips-regularized,
  title     = {{Regularized SoftMax Deep Multi-Agent Q-Learning}},
  author    = {Pan, Ling and Rashid, Tabish and Peng, Bei and Huang, Longbo and Whiteson, Shimon},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/pan2021neurips-regularized/}
}