Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

Abstract

We study reinforcement learning with linear function approximation and adversarially changing cost functions, a setup that has mostly been considered under simplifying assumptions such as full information feedback or exploratory conditions. We present a computationally efficient policy optimization algorithm for the challenging general setting of unknown dynamics and bandit feedback, featuring a combination of mirror-descent and least squares policy evaluation in an auxiliary MDP used to compute exploration bonuses. Our algorithm obtains an $\widetilde O(K^{6/7})$ regret bound, improving significantly over previous state-of-the-art of $\widetilde O (K^{14/15})$ in this setting. In addition, we present a version of the same algorithm under the assumption a simulator of the environment is available to the learner (but otherwise no exploratory assumptions are made), and prove it obtains state-of-the-art regret of $\widetilde O (K^{2/3})$.

Cite

Text

Sherman et al. "Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation." International Conference on Machine Learning, 2023.

Markdown

[Sherman et al. "Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/sherman2023icml-improved/)

BibTeX

@inproceedings{sherman2023icml-improved,
  title     = {{Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation}},
  author    = {Sherman, Uri and Koren, Tomer and Mansour, Yishay},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {31117-31150},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/sherman2023icml-improved/}
}