Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition

Abstract

We study reinforcement learning (RL) with linear function approximation, unknown transition, and adversarial losses in the bandit feedback setting. Specifically, the unknown transition probability function is a linear mixture model \citep{AyoubJSWY20,ZhouGS21,HeZG22} with a given feature mapping, and the learner only observes the losses of the experienced state-action pairs instead of the whole loss function. We propose an efficient algorithm LSUOB-REPS which achieves $\widetilde{O}(dS^2\sqrt{K}+\sqrt{HSAK})$ regret guarantee with high probability, where $d$ is the ambient dimension of the feature mapping, $S$ is the size of the state space, $A$ is the size of the action space, $H$ is the episode length and $K$ is the number of episodes. Furthermore, we also prove a lower bound of order $\Omega(dH\sqrt{K}+\sqrt{HSAK})$ for this setting. To the best of our knowledge, we make the first step to establish a provably efficient algorithm with a sublinear regret guarantee in this challenging setting and solve the open problem of \citet{HeZG22}.

Cite

Text

Zhao et al. "Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition." International Conference on Learning Representations, 2023.

Markdown

[Zhao et al. "Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition." International Conference on Learning Representations, 2023.](https://mlanthology.org/iclr/2023/zhao2023iclr-learning/)

BibTeX

@inproceedings{zhao2023iclr-learning,
  title     = {{Learning Adversarial Linear Mixture Markov Decision Processes with Bandit Feedback and Unknown Transition}},
  author    = {Zhao, Canzhe and Yang, Ruofeng and Wang, Baoxiang and Li, Shuai},
  booktitle = {International Conference on Learning Representations},
  year      = {2023},
  url       = {https://mlanthology.org/iclr/2023/zhao2023iclr-learning/}
}