The Optimal Reward Baseline for Gradient-Based Reinforcement Learning

Abstract

There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.

Cite

Text

Weaver and Tao. "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning." Conference on Uncertainty in Artificial Intelligence, 2001.

Markdown

[Weaver and Tao. "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning." Conference on Uncertainty in Artificial Intelligence, 2001.](https://mlanthology.org/uai/2001/weaver2001uai-optimal/)

BibTeX

@inproceedings{weaver2001uai-optimal,
  title     = {{The Optimal Reward Baseline for Gradient-Based Reinforcement Learning}},
  author    = {Weaver, Lex and Tao, Nigel},
  booktitle = {Conference on Uncertainty in Artificial Intelligence},
  year      = {2001},
  pages     = {538-545},
  url       = {https://mlanthology.org/uai/2001/weaver2001uai-optimal/}
}