Least-Squares Policy Iteration: Bias-Variance Trade-Off in Control Problems
Abstract
In the context of large space MDPs with linear value function approximation, we introduce a new approximate version of Λ-Policy Iteration (Berteskas and Ioffe, 1996), a method that generalizes Value Iteration and Policy Iteration. Our approach, called Least-Squares Λ Policy Iteration, generalizes LSPI (Lagoudakis & Parr, 2003) which makes efficient use of training samples compared to classical temporal-differences methods. The motivation of our work is to exploit the Λ parameter within the least-squares context, and without having to generate new sample sat each iteration or to know a model of the MDP. We provide a performance bound that shows the soundness of the algorithm. We show empirically on a simple chain problem and on the Tetris game that this Λ parameter acts as a bias-variance trade-off that may improve the convergence and the performance of the policy obtained.
Cite
Text
Thiery and Scherrer. "Least-Squares Policy Iteration: Bias-Variance Trade-Off in Control Problems." International Conference on Machine Learning, 2010.Markdown
[Thiery and Scherrer. "Least-Squares Policy Iteration: Bias-Variance Trade-Off in Control Problems." International Conference on Machine Learning, 2010.](https://mlanthology.org/icml/2010/thiery2010icml-least/)BibTeX
@inproceedings{thiery2010icml-least,
title = {{Least-Squares Policy Iteration: Bias-Variance Trade-Off in Control Problems}},
author = {Thiery, Christophe and Scherrer, Bruno},
booktitle = {International Conference on Machine Learning},
year = {2010},
pages = {1071-1078},
url = {https://mlanthology.org/icml/2010/thiery2010icml-least/}
}