Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

Antos, András; Szepesvári, Csaba; Munos, Rémi

doi:10.1007/S10994-007-5038-2

Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

András Antos, Csaba Szepesvári, Rémi Munos

MLJ 2008 pp. 89-129

doi:10.1007/S10994-007-5038-2 /mlj/2008/antos2008mlj-learning/

Abstract

In this paper we consider the problem of finding a near-optimal policy in a continuous space, discounted Markovian Decision Problem (MDP) by employing value-function-based methods when only a single trajectory of a fixed policy is available as the input. We study a policy-iteration algorithm where the iterates are obtained via empirical risk minimization with a risk function that penalizes high magnitudes of the Bellman-residual. Our main result is a finite-sample, high-probability bound on the performance of the computed policy that depends on the mixing rate of the trajectory, the capacity of the function set as measured by a novel capacity concept (the VC -crossing dimension), the approximation power of the function set and the controllability properties of the MDP. Moreover, we prove that when a linear parameterization is used the new algorithm is equivalent to Least-Squares Policy Iteration. To the best of our knowledge this is the first theoretical result for off-policy control learning over continuous state-spaces using a single trajectory.

PDF MLJ Semantic Scholar

Cite

Text

Antos et al. "Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path." Machine Learning, 2008. doi:10.1007/S10994-007-5038-2

Markdown

[Antos et al. "Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path." Machine Learning, 2008.](https://mlanthology.org/mlj/2008/antos2008mlj-learning/) doi:10.1007/S10994-007-5038-2

BibTeX

@article{antos2008mlj-learning,
  title     = {{Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path}},
  author    = {Antos, András and Szepesvári, Csaba and Munos, Rémi},
  journal   = {Machine Learning},
  year      = {2008},
  pages     = {89-129},
  doi       = {10.1007/S10994-007-5038-2},
  volume    = {71},
  url       = {https://mlanthology.org/mlj/2008/antos2008mlj-learning/}
}