Reward-Mixing MDPs with Few Latent Contexts Are Learnable
Abstract
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Prior work established an upper bound for RMMDPs with $M=2$. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary $M\ge2$ and provide a sample-efficient algorithm–$EM^2$–that outputs an $\epsilon$-optimal policy using $O \left(\epsilon^{-2} \cdot S^d A^d \cdot \text{poly}(H, Z)^d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=O(\min(M,H))$. We also provide a $(SA)^{\Omega(\sqrt{M})} / \epsilon^{2}$ lower bound, supporting that super-polynomial sample complexity in $M$ is necessary.
Cite
Text
Kwon et al. "Reward-Mixing MDPs with Few Latent Contexts Are Learnable." International Conference on Machine Learning, 2023.Markdown
[Kwon et al. "Reward-Mixing MDPs with Few Latent Contexts Are Learnable." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/kwon2023icml-rewardmixing/)BibTeX
@inproceedings{kwon2023icml-rewardmixing,
title = {{Reward-Mixing MDPs with Few Latent Contexts Are Learnable}},
author = {Kwon, Jeongyeol and Efroni, Yonathan and Caramanis, Constantine and Mannor, Shie},
booktitle = {International Conference on Machine Learning},
year = {2023},
pages = {18057-18082},
volume = {202},
url = {https://mlanthology.org/icml/2023/kwon2023icml-rewardmixing/}
}