Error Bounds for Approximate Policy Iteration

Abstract

In Dynamic Programming, convergence of algorithms such as Value Iteration or Policy Iteration results -in discounted problems -- from a contraction property of the back-up operator, guaranteeing convergence to its fixedpoint. When approximation is considered, known results in Approximate Policy Iteration provide bounds on the closeness to optimality of the approximate value function obtained by successive policy improvement steps as a function of the maximum norm of value determination errors during policy evaluation steps. Unfortunately, such results have limited practical range since most function approximators (such as linear regression) select the best fit in a given class of parameterized functions by minimizing some (weighted) quadratic norm. In this paper, we provide error bounds for Approximate Policy Iteration using quadratic norms, and illustrate those results in the case of feature-based linear function approximation. ICML Proceedings of the Twentieth International Conference on Machine Learning

Cite

Text

Munos. "Error Bounds for Approximate Policy Iteration." International Conference on Machine Learning, 2003.

Markdown

[Munos. "Error Bounds for Approximate Policy Iteration." International Conference on Machine Learning, 2003.](https://mlanthology.org/icml/2003/munos2003icml-error/)

BibTeX

@inproceedings{munos2003icml-error,
  title     = {{Error Bounds for Approximate Policy Iteration}},
  author    = {Munos, Rémi},
  booktitle = {International Conference on Machine Learning},
  year      = {2003},
  pages     = {560-567},
  url       = {https://mlanthology.org/icml/2003/munos2003icml-error/}
}