Increasingly Cautious Optimism for Practical PAC-MDP Exploration

Abstract

Exploration strategy is an essential part of learning agents in model-based Reinforcement Learning. R-MAX and V-MAX are PAC-MDP strategies proved to have polynomial sample complexity; yet, their exploration behavior tend to be overly cautious in practice. We propose the principle of Increasingly Cautious Optimism (ICO) to automatically cut off unnecessarily cautious exploration, and apply ICO to R-MAX and V-MAX, yielding two new strategies, namely Increasingly Cautious R-MAX (ICR) and Increasingly Cautious V-MAX (ICV). We prove that both ICR and ICV are PACMDP, and show that their improvement is guaranteed by a tighter sample complexity upper bound. Then, we demonstrate their significantly improved performance through empirical results.

Cite

Text

Zhang et al. "Increasingly Cautious Optimism for Practical PAC-MDP Exploration." International Joint Conference on Artificial Intelligence, 2015.

Markdown

[Zhang et al. "Increasingly Cautious Optimism for Practical PAC-MDP Exploration." International Joint Conference on Artificial Intelligence, 2015.](https://mlanthology.org/ijcai/2015/zhang2015ijcai-increasingly/)

BibTeX

@inproceedings{zhang2015ijcai-increasingly,
  title     = {{Increasingly Cautious Optimism for Practical PAC-MDP Exploration}},
  author    = {Zhang, Liangpeng and Tang, Ke and Yao, Xin},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2015},
  pages     = {4033-4040},
  url       = {https://mlanthology.org/ijcai/2015/zhang2015ijcai-increasingly/}
}