The Optimal Reward Baseline for Gradient-Based Reinforcement Learning
Abstract
There exist a number of reinforcement learning algorithms which learn by climbing the gradient of expected reward. Their long-run convergence has been proved, even in partially observable environments with non-deterministic actions, and without the need for a system model. However, the variance of the gradient estimator has been found to be a significant practical problem. Recent approaches have discounted future rewards, introducing a bias-variance trade-off into the gradient estimate. We incorporate a reward baseline into the learning system, and show that it affects variance without introducing further bias. In particular, as we approach the zerobias, high-variance parametedzation, the optimal (or variance minimizing) constant reward baseline is equal to the long-term average expected reward. Modified policy-gradient algorithms are presented, and a number of experiments demonstrate their improvement over previous work.
Cite
Text
Weaver and Tao. "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning." Conference on Uncertainty in Artificial Intelligence, 2001.Markdown
[Weaver and Tao. "The Optimal Reward Baseline for Gradient-Based Reinforcement Learning." Conference on Uncertainty in Artificial Intelligence, 2001.](https://mlanthology.org/uai/2001/weaver2001uai-optimal/)BibTeX
@inproceedings{weaver2001uai-optimal,
title = {{The Optimal Reward Baseline for Gradient-Based Reinforcement Learning}},
author = {Weaver, Lex and Tao, Nigel},
booktitle = {Conference on Uncertainty in Artificial Intelligence},
year = {2001},
pages = {538-545},
url = {https://mlanthology.org/uai/2001/weaver2001uai-optimal/}
}