Off-Policy Correction for Multi-Agent Reinforcement Learning

Abstract

Multi-agent reinforcement learning (MARL) provides a framework for problems involving multiple interacting agents. Despite apparent similarity to the single-agent case, multi-agent problems are often harder to train and analyze theoretically. In this work, we propose MA-Trace, a new on-policy actor-critic algorithm, which extends V-Trace to the MARL setting. The key advantage of our algorithm is its high scalability in a multi-worker setting. To this end, MA-Trace utilizes importance sampling as an off-policy correction method, which allows distributing the computations with no impact on the quality of training. Furthermore, our algorithm is theoretically grounded - we prove a fixed-point theorem that guarantees convergence. We evaluate the algorithm extensively on the StarCraft Multi-Agent Challenge, a standard benchmark for multi-agent algorithms. MA-Trace achieves high performance on all its tasks and exceeds state-of-the-art results on some of them.

Cite

Text

Zawalski et al. "Off-Policy Correction for Multi-Agent Reinforcement Learning." NeurIPS 2021 Workshops: DeepRL, 2021.

Markdown

[Zawalski et al. "Off-Policy Correction for Multi-Agent Reinforcement Learning." NeurIPS 2021 Workshops: DeepRL, 2021.](https://mlanthology.org/neuripsw/2021/zawalski2021neuripsw-offpolicy/)

BibTeX

@inproceedings{zawalski2021neuripsw-offpolicy,
  title     = {{Off-Policy Correction for Multi-Agent Reinforcement Learning}},
  author    = {Zawalski, Michał and Osiński, Błażej and Michalewski, Henryk and Miłoś, Piotr},
  booktitle = {NeurIPS 2021 Workshops: DeepRL},
  year      = {2021},
  url       = {https://mlanthology.org/neuripsw/2021/zawalski2021neuripsw-offpolicy/}
}