Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

Abstract

We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learn- ing problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sam- ple paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the vari- ance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the op- timal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary exper- iments that illustrate the performance improvements on a simple control problem.

Cite

Text

Greensmith et al. "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning." Neural Information Processing Systems, 2001.

Markdown

[Greensmith et al. "Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning." Neural Information Processing Systems, 2001.](https://mlanthology.org/neurips/2001/greensmith2001neurips-variance/)

BibTeX

@inproceedings{greensmith2001neurips-variance,
  title     = {{Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning}},
  author    = {Greensmith, Evan and Bartlett, Peter L. and Baxter, Jonathan},
  booktitle = {Neural Information Processing Systems},
  year      = {2001},
  pages     = {1507-1514},
  url       = {https://mlanthology.org/neurips/2001/greensmith2001neurips-variance/}
}