Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Abstract

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, \ie, both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain \emph{variation budgets}. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (\texttt{SWUCRL2-CW}) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (\texttt{BORL}) algorithm to adaptively tune the \sw to achieve the same dynamic regret bound, but in a \emph{parameter-free} manner, \ie, without knowing the variation budgets. Notably, learning drifting MDPs via conventional optimistic exploration presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.

Cite

Text

Cheung et al. "Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism." International Conference on Machine Learning, 2020.

Markdown

[Cheung et al. "Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism." International Conference on Machine Learning, 2020.](https://mlanthology.org/icml/2020/cheung2020icml-reinforcement/)

BibTeX

@inproceedings{cheung2020icml-reinforcement,
  title     = {{Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism}},
  author    = {Cheung, Wang Chi and Simchi-Levi, David and Zhu, Ruihao},
  booktitle = {International Conference on Machine Learning},
  year      = {2020},
  pages     = {1843-1854},
  volume    = {119},
  url       = {https://mlanthology.org/icml/2020/cheung2020icml-reinforcement/}
}