Toward Off-Policy Learning Control with Function Approximation

Abstract

We present the first temporal-difference learning algorithm for off-policy control with unrestricted linear function approximation whose per-time-step complexity is linear in the number of features. Our algorithm, text it Greedy-GQ, is an extension of recent work on gradient temporal-difference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal action-value function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting text it latent learning because the optimal policy, though learned, is not manifest in behavior. Popular off-policy algorithms such as Q-learning are known to be unstable in this setting when used with linear function approximation.

Cite

Text

Maei et al. "Toward Off-Policy Learning Control with Function Approximation." International Conference on Machine Learning, 2010.

Markdown

[Maei et al. "Toward Off-Policy Learning Control with Function Approximation." International Conference on Machine Learning, 2010.](https://mlanthology.org/icml/2010/maei2010icml-off/)

BibTeX

@inproceedings{maei2010icml-off,
  title     = {{Toward Off-Policy Learning Control with Function Approximation}},
  author    = {Maei, Hamid Reza and Szepesvári, Csaba and Bhatnagar, Shalabh and Sutton, Richard S.},
  booktitle = {International Conference on Machine Learning},
  year      = {2010},
  pages     = {719-726},
  url       = {https://mlanthology.org/icml/2010/maei2010icml-off/}
}