Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis

Abstract

We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm, which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importance-sampling term. We study conditions under which the projected fixed-point equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.

Cite

Text

Hallak et al. "Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis." AAAI Conference on Artificial Intelligence, 2016. doi:10.1609/AAAI.V30I1.10227

Markdown

[Hallak et al. "Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis." AAAI Conference on Artificial Intelligence, 2016.](https://mlanthology.org/aaai/2016/hallak2016aaai-generalized/) doi:10.1609/AAAI.V30I1.10227

BibTeX

@inproceedings{hallak2016aaai-generalized,
  title     = {{Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis}},
  author    = {Hallak, Assaf and Tamar, Aviv and Munos, Rémi and Mannor, Shie},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2016},
  pages     = {1631-1637},
  doi       = {10.1609/AAAI.V30I1.10227},
  url       = {https://mlanthology.org/aaai/2016/hallak2016aaai-generalized/}
}