On Average Versus Discounted Reward Temporal-Difference Learning

Tsitsiklis, John N.; Van Roy, Benjamin

doi:10.1023/A:1017980312899

On Average Versus Discounted Reward Temporal-Difference Learning

John N. Tsitsiklis, Benjamin Van Roy

MLJ 2002 pp. 179-191

doi:10.1023/A:1017980312899 /mlj/2002/tsitsiklis2002mlj-average/

Abstract

We provide an analytical comparison between discounted and average reward temporal-difference (TD) learning with linearly parameterized approximations. We first consider the asymptotic behavior of the two algorithms. We show that as the discount factor approaches 1, the value function produced by discounted TD approaches the differential value function generated by average reward TD. We further argue that if the constant function—which is typically used as one of the basis functions in discounted TD—is appropriately scaled, the transient behaviors of the two algorithms are also similar. Our analysis suggests that the computational advantages of average reward TD that have been observed in some prior empirical work may have been caused by inappropriate basis function scaling rather than fundamental differences in problem formulations or algorithms.

PDF MLJ Semantic Scholar

Cite

Text

Tsitsiklis and Van Roy. "On Average Versus Discounted Reward Temporal-Difference Learning." Machine Learning, 2002. doi:10.1023/A:1017980312899

Markdown

[Tsitsiklis and Van Roy. "On Average Versus Discounted Reward Temporal-Difference Learning." Machine Learning, 2002.](https://mlanthology.org/mlj/2002/tsitsiklis2002mlj-average/) doi:10.1023/A:1017980312899

BibTeX

@article{tsitsiklis2002mlj-average,
  title     = {{On Average Versus Discounted Reward Temporal-Difference Learning}},
  author    = {Tsitsiklis, John N. and Van Roy, Benjamin},
  journal   = {Machine Learning},
  year      = {2002},
  pages     = {179-191},
  doi       = {10.1023/A:1017980312899},
  volume    = {49},
  url       = {https://mlanthology.org/mlj/2002/tsitsiklis2002mlj-average/}
}