Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

Abstract

We model reinforcement learning as the problem of learning to control a partially observable Markov decision process (POMDP) and focus on gradient ascent approaches to this problem. In an earlier work (2001, J. Artificial Intelligence Res. 14) we introduced GPOMDP, an algorithm for estimating the performance gradient of a POMDP from a single sample path, and we proved that this algorithm almost surely converges to an approximation to the gradient. In this paper, we provide a convergence rate for the estimates produced by GPOMDP and give an improved bound on the approximation error of these estimates. Both of these bounds are in terms of mixing times of the POMDP.

Cite

Text

Bartlett and Baxter. "Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning." Annual Conference on Computational Learning Theory, 2000. doi:10.1006/jcss.2001.1793

Markdown

[Bartlett and Baxter. "Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning." Annual Conference on Computational Learning Theory, 2000.](https://mlanthology.org/colt/2000/bartlett2000colt-estimation/) doi:10.1006/jcss.2001.1793

BibTeX

@inproceedings{bartlett2000colt-estimation,
  title     = {{Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning}},
  author    = {Bartlett, Peter L. and Baxter, Jonathan},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2000},
  pages     = {133-141},
  doi       = {10.1006/jcss.2001.1793},
  url       = {https://mlanthology.org/colt/2000/bartlett2000colt-estimation/}
}