To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

Abstract

Most work in reinforcement learning (RL) is based on discounted techniques, such as Q learning, where long-term rewards are geometrically attenuated based on the delay in their occurence. Schwartz recently proposed an undiscounted RL technique called R learning that optimizes average reward, and argued that it was a better metric than the discounted one optimized by Q learning. In this paper we compare R learning with Q learning on a simulated robot box-pushing task. We compare these two techniques across three different exploration strategies: two of them undirected, Boltzmann and semi-uniform, and one recency-based directed strategy. Our results show that Q learning performs better than R learning, even when both are evaluated using the same undiscounted performance measure. Furthermore, R learning appears to be very sensitive to choice of exploration strategy. In particular, a surprising result is that R learning's performance noticeably deteriorates under Boltzmann exploration. We identify precisely a limit cycle situation that causes R learning's performance to deteriorate when combined with Boltzmann exploration, and show where such limit cycles arise in our robot task. However, R learning performs much better (although not as well as Q learning) when combined with semi-uniform and recency-based exploration. In this paper, we also argue for using medians over means as a better distribution-free estimator of average performance, and describe a simple non-parametric significance test for comparing learning data from two RL techniques.

Cite

Text

Mahadevan. "To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning." International Conference on Machine Learning, 1994. doi:10.1016/B978-1-55860-335-6.50028-3

Markdown

[Mahadevan. "To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning." International Conference on Machine Learning, 1994.](https://mlanthology.org/icml/1994/mahadevan1994icml-discount/) doi:10.1016/B978-1-55860-335-6.50028-3

BibTeX

@inproceedings{mahadevan1994icml-discount,
  title     = {{To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning}},
  author    = {Mahadevan, Sridhar},
  booktitle = {International Conference on Machine Learning},
  year      = {1994},
  pages     = {164-172},
  doi       = {10.1016/B978-1-55860-335-6.50028-3},
  url       = {https://mlanthology.org/icml/1994/mahadevan1994icml-discount/}
}