Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of $\tilde{O}(D\sqrt{SAT})$ for any communicating MDP with $S$ states, $A$ actions and diameter $D$, when $T\ge S^5A$. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon $T$. This result improves over the best previously known upper bound of $\tilde{O}(DS\sqrt{AT})$ achieved by any algorithm in this setting, and matches the dependence on $S$ in the established lower bound of $\Omega(\sqrt{DSAT})$ for this problem. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.

Cite

Text

Agrawal and Jia. "Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds." Neural Information Processing Systems, 2017.

Markdown

[Agrawal and Jia. "Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds." Neural Information Processing Systems, 2017.](https://mlanthology.org/neurips/2017/agrawal2017neurips-optimistic/)

BibTeX

@inproceedings{agrawal2017neurips-optimistic,
  title     = {{Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds}},
  author    = {Agrawal, Shipra and Jia, Randy},
  booktitle = {Neural Information Processing Systems},
  year      = {2017},
  pages     = {1184-1194},
  url       = {https://mlanthology.org/neurips/2017/agrawal2017neurips-optimistic/}
}