Thompson Sampling for Markov Games with Piecewise Stationary Opponent Policies
Abstract
Reinforcement learning problems with multiple agents pose the challenge of efficiently adapting to nonstationary dynamics arising from other agents’ strategic behavior. Although several algorithms exist for these problems with promising empirical results, regret analysis and efficient use of other-agent models in general-sum games is very limited. We propose an algorithm (TSMG) for general-sum Markov games against agents that switch between several stationary policies, combining change detection with Thompson sampling to learn parametric models of these policies. Under standard assumptions for parametric Markov decision process (MDP) learning, the expected regret of TSMG in the worst case over policy parameters and switch schedules is near-optimal in time and number of switches, up to logarithmic factors. Our experiments on simulated games show that TSMG can outperform standard Thompson sampling and a version of Thompson sampling with a static reset schedule, despite the violation of an assumption that the MDPs induced by the other player are ergodic.
Cite
Text
DiGiovanni and Tewari. "Thompson Sampling for Markov Games with Piecewise Stationary Opponent Policies." Uncertainty in Artificial Intelligence, 2021.Markdown
[DiGiovanni and Tewari. "Thompson Sampling for Markov Games with Piecewise Stationary Opponent Policies." Uncertainty in Artificial Intelligence, 2021.](https://mlanthology.org/uai/2021/digiovanni2021uai-thompson/)BibTeX
@inproceedings{digiovanni2021uai-thompson,
title = {{Thompson Sampling for Markov Games with Piecewise Stationary Opponent Policies}},
author = {DiGiovanni, Anthony and Tewari, Ambuj},
booktitle = {Uncertainty in Artificial Intelligence},
year = {2021},
pages = {738-748},
volume = {161},
url = {https://mlanthology.org/uai/2021/digiovanni2021uai-thompson/}
}