Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism
Abstract
We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, \ie, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Notably, learning drifting MDPs via conventional optimistic exploration presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.
Cite
Text
Cheung et al. "Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism." International Conference on Machine Learning, 2020.Markdown
[Cheung et al. "Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/cheung2020icml-reinforcement/)BibTeX
@inproceedings{cheung2020icml-reinforcement,
title = {{Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism}},
author = {Cheung, Wang Chi and Simchi-Levi, David and Zhu, Ruihao},
booktitle = {International Conference on Machine Learning},
year = {2020},
pages = {1843-1854},
volume = {119},
url = {https://mlanthology.org/icml/2020/cheung2020icml-reinforcement/}
}