Variational Bayesian Reinforcement Learning with Regret Bounds

Abstract

We consider the exploration-exploitation trade-off in reinforcement learning and show that an agent endowed with an exponential epistemic-risk-seeking utility function explores efficiently, as measured by regret. The state-action values induced by the exponential utility satisfy a Bellman recursion, so we can use dynamic programming to compute them. We call the resulting algorithm K-learning (for knowledge) and the risk-seeking utility ensures that the associated state-action values (K-values) are optimistic for the expected optimal Q-values under the posterior. The exponential utility function induces a Boltzmann exploration policy for which the 'temperature' parameter is equal to the risk-seeking parameter and is carefully controlled to yield a Bayes regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed timesteps. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.

Cite

Text

O'Donoghue. "Variational Bayesian Reinforcement Learning with Regret Bounds." Neural Information Processing Systems, 2021.

Markdown

[O'Donoghue. "Variational Bayesian Reinforcement Learning with Regret Bounds." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/odonoghue2021neurips-variational-a/)

BibTeX

@inproceedings{odonoghue2021neurips-variational-a,
  title     = {{Variational Bayesian Reinforcement Learning with Regret Bounds}},
  author    = {O'Donoghue, Brendan},
  booktitle = {Neural Information Processing Systems},
  year      = {2021},
  url       = {https://mlanthology.org/neurips/2021/odonoghue2021neurips-variational-a/}
}