On Thompson Sampling and Asymptotic Optimality

Abstract

We discuss some recent results on Thompson sampling for nonparametric reinforcement learning in countable classes of general stochastic environments. These environments can be non-Markovian, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges in mean to the optimal value and (2) given a recoverability assumption regret is sublinear. We conclude with a discussion about optimality in reinforcement learning.

Cite

Text

Leike et al. "On Thompson Sampling and Asymptotic Optimality." International Joint Conference on Artificial Intelligence, 2017. doi:10.24963/IJCAI.2017/688

Markdown

[Leike et al. "On Thompson Sampling and Asymptotic Optimality." International Joint Conference on Artificial Intelligence, 2017.](https://mlanthology.org/ijcai/2017/leike2017ijcai-thompson/) doi:10.24963/IJCAI.2017/688

BibTeX

@inproceedings{leike2017ijcai-thompson,
  title     = {{On Thompson Sampling and Asymptotic Optimality}},
  author    = {Leike, Jan and Lattimore, Tor and Orseau, Laurent and Hutter, Marcus},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2017},
  pages     = {4889-4893},
  doi       = {10.24963/IJCAI.2017/688},
  url       = {https://mlanthology.org/ijcai/2017/leike2017ijcai-thompson/}
}