Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Abstract
We present a learning algorithm for undiscounted reinforcement learning. Our interest lies in bounds for the algorithm's online performance after some finite number of steps. In the spirit of similar methods already successfully applied for the exploration-exploitation tradeoff in multi-armed bandit problems, we use upper confidence bounds to show that our UCRL algorithm achieves logarithmic online regret in the number of steps taken with respect to an optimal policy.
Cite
Text
Auer and Ortner. "Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning." Neural Information Processing Systems, 2006.Markdown
[Auer and Ortner. "Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning." Neural Information Processing Systems, 2006.](https://mlanthology.org/neurips/2006/auer2006neurips-logarithmic/)BibTeX
@inproceedings{auer2006neurips-logarithmic,
title = {{Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning}},
author = {Auer, Peter and Ortner, Ronald},
booktitle = {Neural Information Processing Systems},
year = {2006},
pages = {49-56},
url = {https://mlanthology.org/neurips/2006/auer2006neurips-logarithmic/}
}