Improving Policy Gradient by Exploring Under-Appreciated Rewards

Abstract

This paper presents a novel form of policy gradient for model-free reinforcement learning (RL) with improved exploration properties. Current policy-based methods use entropy regularization to encourage undirected exploration of the reward landscape, which is ineffective in high dimensional spaces with sparse rewards. We propose a more directed exploration strategy that promotes exploration of under-appreciated reward regions. An action sequence is considered under-appreciated if its log-probability under the current policy under-estimates its resulting reward. The proposed exploration strategy is easy to implement, requiring small modifications to an implementation of the REINFORCE algorithm. We evaluate the approach on a set of algorithmic tasks that have long challenged RL methods. Our approach reduces hyper-parameter sensitivity and demonstrates significant improvements over baseline methods. Our algorithm successfully solves a benchmark multi-digit addition task and generalizes to long sequences. This is, to our knowledge, the first time that a pure RL method has solved addition using only reward feedback.

Cite

Text

Nachum et al. "Improving Policy Gradient by Exploring Under-Appreciated Rewards." International Conference on Learning Representations, 2017.

Markdown

[Nachum et al. "Improving Policy Gradient by Exploring Under-Appreciated Rewards." International Conference on Learning Representations, 2017.](https://mlanthology.org/iclr/2017/nachum2017iclr-improving/)

BibTeX

@inproceedings{nachum2017iclr-improving,
  title     = {{Improving Policy Gradient by Exploring Under-Appreciated Rewards}},
  author    = {Nachum, Ofir and Norouzi, Mohammad and Schuurmans, Dale},
  booktitle = {International Conference on Learning Representations},
  year      = {2017},
  url       = {https://mlanthology.org/iclr/2017/nachum2017iclr-improving/}
}