Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Antos, András; Szepesvári, Csaba; Munos, Rémi

doi:10.1007/11776420_42

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

András Antos, Csaba Szepesvári, Rémi Munos

COLT 2006 pp. 574-588

doi:10.1007/11776420_42 /colt/2006/antos2006colt-learning/

Abstract

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q -functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

PDF COLT Semantic Scholar

Cite

Text

Antos et al. "Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path." Annual Conference on Computational Learning Theory, 2006. doi:10.1007/11776420_42

Markdown

[Antos et al. "Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path." Annual Conference on Computational Learning Theory, 2006.](https://mlanthology.org/colt/2006/antos2006colt-learning/) doi:10.1007/11776420_42

BibTeX

@inproceedings{antos2006colt-learning,
  title     = {{Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path}},
  author    = {Antos, András and Szepesvári, Csaba and Munos, Rémi},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {2006},
  pages     = {574-588},
  doi       = {10.1007/11776420_42},
  url       = {https://mlanthology.org/colt/2006/antos2006colt-learning/}
}