Model-Free Posterior Sampling via Learning Rate Randomization

Abstract

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $\widetilde{\mathcal{O}}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $\widetilde{\mathcal{O}}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

Cite

Text

Tiapkin et al. "Model-Free Posterior Sampling via Learning Rate Randomization." Neural Information Processing Systems, 2023.

Markdown

[Tiapkin et al. "Model-Free Posterior Sampling via Learning Rate Randomization." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/tiapkin2023neurips-modelfree/)

BibTeX

@inproceedings{tiapkin2023neurips-modelfree,
  title     = {{Model-Free Posterior Sampling via Learning Rate Randomization}},
  author    = {Tiapkin, Daniil and Belomestny, Denis and Calandriello, Daniele and Moulines, Eric and Munos, Remi and Naumov, Alexey and Perrault, Pierre and Valko, Michal and Ménard, Pierre},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/tiapkin2023neurips-modelfree/}
}