Non-Stationary Reinforcement Learning Without Prior Knowledge: An Optimal Black-Box Approach

Abstract

We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalzed) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{\frac{1}{3}}T^{\frac{2}{3}}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.

Cite

Text

Wei and Luo. "Non-Stationary Reinforcement Learning Without Prior Knowledge: An Optimal Black-Box Approach." Conference on Learning Theory, 2021.

Markdown

[Wei and Luo. "Non-Stationary Reinforcement Learning Without Prior Knowledge: An Optimal Black-Box Approach." Conference on Learning Theory, 2021.](https://mlanthology.org/colt/2021/wei2021colt-nonstationary/)

BibTeX

@inproceedings{wei2021colt-nonstationary,
  title     = {{Non-Stationary Reinforcement Learning Without Prior Knowledge: An Optimal Black-Box Approach}},
  author    = {Wei, Chen-Yu and Luo, Haipeng},
  booktitle = {Conference on Learning Theory},
  year      = {2021},
  pages     = {4300-4354},
  volume    = {134},
  url       = {https://mlanthology.org/colt/2021/wei2021colt-nonstationary/}
}