A Reinforcement Learning Method for Maximizing Undiscounted Rewards
Abstract
While most Reinforcement Learning work utilizes temporal discounting to evaluate performance, the reasons for this are unclear. Is it out of desire or necessity? We argue that it is not out of desire, and seek to dispel the notion that temporal discounting is necessary by proposing a framework for undiscounted optimization. We present a metric of undiscounted performance and an algorithm for finding action policies that maximize that measure. The technique, which we call R-learning, is modelled after the popular Q-learning algorithm [17]. Initial experimental results are presented which attest to a great improvement over Q-learning in some simple cases.
Cite
Text
Schwartz. "A Reinforcement Learning Method for Maximizing Undiscounted Rewards." International Conference on Machine Learning, 1993. doi:10.1016/B978-1-55860-307-3.50045-9Markdown
[Schwartz. "A Reinforcement Learning Method for Maximizing Undiscounted Rewards." International Conference on Machine Learning, 1993.](https://mlanthology.org/icml/1993/schwartz1993icml-reinforcement/) doi:10.1016/B978-1-55860-307-3.50045-9BibTeX
@inproceedings{schwartz1993icml-reinforcement,
title = {{A Reinforcement Learning Method for Maximizing Undiscounted Rewards}},
author = {Schwartz, Anton},
booktitle = {International Conference on Machine Learning},
year = {1993},
pages = {298-305},
doi = {10.1016/B978-1-55860-307-3.50045-9},
url = {https://mlanthology.org/icml/1993/schwartz1993icml-reinforcement/}
}