A Reinforcement Learning Method for Maximizing Undiscounted Rewards

Abstract

While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call R-learning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.

Cite

Text

Schwartz. "A Reinforcement Learning Method for Maximizing Undiscounted Rewards." International Conference on Machine Learning, 1993. doi:10.1016/B978-1-55860-307-3.50045-9

Markdown

[Schwartz. "A Reinforcement Learning Method for Maximizing Undiscounted Rewards." International Conference on Machine Learning, 1993.](https://mlanthology.org/icml/1993/schwartz1993icml-reinforcement/) doi:10.1016/B978-1-55860-307-3.50045-9

BibTeX

@inproceedings{schwartz1993icml-reinforcement,
  title     = {{A Reinforcement Learning Method for Maximizing Undiscounted Rewards}},
  author    = {Schwartz, Anton},
  booktitle = {International Conference on Machine Learning},
  year      = {1993},
  pages     = {298-305},
  doi       = {10.1016/B978-1-55860-307-3.50045-9},
  url       = {https://mlanthology.org/icml/1993/schwartz1993icml-reinforcement/}
}