Classification-Based Policy Iteration with a Critic

Abstract

In this paper, we study the effect of adding a value function approximation component (critic) to rollout classification-based policy iteration (RCPI) algorithms. The idea is to use a critic to approximate the return after we truncate the rollout trajectories. This allows us to control the bias and variance of the rollout estimates of the action-value function. Therefore, the introduction of a critic can improve the accuracy of the rollout estimates, and as a result, enhance the performance of the RCPI algorithm. We present a new RCPI algorithm, called direct policy iteration with critic (DPI-Critic), and provide its finite-sample analysis when the critic is based on the LSTD method. We empirically evaluate the performance of DPI-Critic and compare it with DPI and LSPI in two benchmark reinforcement learning problems.

Cite

Text

Gabillon et al. "Classification-Based Policy Iteration with a Critic." International Conference on Machine Learning, 2011.

Markdown

[Gabillon et al. "Classification-Based Policy Iteration with a Critic." International Conference on Machine Learning, 2011.](https://mlanthology.org/icml/2011/gabillon2011icml-classification/)

BibTeX

@inproceedings{gabillon2011icml-classification,
  title     = {{Classification-Based Policy Iteration with a Critic}},
  author    = {Gabillon, Victor and Lazaric, Alessandro and Ghavamzadeh, Mohammad and Scherrer, Bruno},
  booktitle = {International Conference on Machine Learning},
  year      = {2011},
  pages     = {1049-1056},
  url       = {https://mlanthology.org/icml/2011/gabillon2011icml-classification/}
}