Variational Bayesian Reinforcement Learning with Regret Bounds
Abstract
We consider the exploration-exploitation trade-off in reinforcement learning and show that an agent endowed with an exponential epistemic-risk-seeking utility function explores efficiently, as measured by regret. The state-action values induced by the exponential utility satisfy a Bellman recursion, so we can use dynamic programming to compute them. We call the resulting algorithm K-learning (for knowledge) and the risk-seeking utility ensures that the associated state-action values (K-values) are optimistic for the expected optimal Q-values under the posterior. The exponential utility function induces a Boltzmann exploration policy for which the 'temperature' parameter is equal to the risk-seeking parameter and is carefully controlled to yield a Bayes regret bound of $\tilde O(L^{3/2} \sqrt{S A T})$, where $L$ is the time horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the total number of elapsed timesteps. We conclude with a numerical example demonstrating that K-learning is competitive with other state-of-the-art algorithms in practice.
Cite
Text
O'Donoghue. "Variational Bayesian Reinforcement Learning with Regret Bounds." Neural Information Processing Systems, 2021.Markdown
[O'Donoghue. "Variational Bayesian Reinforcement Learning with Regret Bounds." Neural Information Processing Systems, 2021.](https://mlanthology.org/neurips/2021/odonoghue2021neurips-variational-a/)BibTeX
@inproceedings{odonoghue2021neurips-variational-a,
title = {{Variational Bayesian Reinforcement Learning with Regret Bounds}},
author = {O'Donoghue, Brendan},
booktitle = {Neural Information Processing Systems},
year = {2021},
url = {https://mlanthology.org/neurips/2021/odonoghue2021neurips-variational-a/}
}