Regularized Off-Policy TD-Learning

Abstract

We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

Cite

Text

Liu et al. "Regularized Off-Policy TD-Learning." Neural Information Processing Systems, 2012.

Markdown

[Liu et al. "Regularized Off-Policy TD-Learning." Neural Information Processing Systems, 2012.](https://mlanthology.org/neurips/2012/liu2012neurips-regularized/)

BibTeX

@inproceedings{liu2012neurips-regularized,
  title     = {{Regularized Off-Policy TD-Learning}},
  author    = {Liu, Bo and Mahadevan, Sridhar and Liu, Ji},
  booktitle = {Neural Information Processing Systems},
  year      = {2012},
  pages     = {836-844},
  url       = {https://mlanthology.org/neurips/2012/liu2012neurips-regularized/}
}