Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results

Mahadevan, Sridhar

doi:10.1023/A:1018064306595

Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results

Sridhar Mahadevan

MLJ 1996 pp. 159-195

doi:10.1023/A:1018064306595 /mlj/1996/mahadevan1996mlj-average/

Abstract

This paper presents a detailed study of average reward reinforcement learning, an undiscounted optimality framework that is more appropriate for cyclical tasks than the much better studied discounted framework. A wide spectrum of average reward algorithms are described, ranging from synchronous dynamic programming methods to several (provably convergent) asynchronous algorithms from optimal control and learning automata. A general sensitive discount optimality metric called n-discount-optimality is introduced, and used to compare the various algorithms. The overview identifies a key similarity across several asynchronous algorithms that is crucial to their convergence, namely independent estimation of the average reward and the relative values. The overview also uncovers a surprising limitation shared by the different algorithms: while several algorithms can provably generate gain-optimal policies that maximize average reward, none of them can reliably filter these to produce bias-optimal (or T-optimal) policies that also maximize the finite reward to absorbing goal states. This paper also presents a detailed empirical study of R-learning, an average reward reinforcement learning method, using two empirical testbeds: a stochastic grid world domain and a simulated robot environment. A detailed sensitivity analysis of R-learning is carried out to test its dependence on learning rates and exploration levels. The results suggest that R-learning is quite sensitive to exploration strategies, and can fall into sub-optimal limit cycles. The performance of R-learning is also compared with that of Q-learning, the best studied discounted RL method. Here, the results suggest that R-learning can be fine-tuned to give better performance than Q-learning in both domains.

PDF MLJ Semantic Scholar

Cite

Text

Mahadevan. "Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results." Machine Learning, 1996. doi:10.1023/A:1018064306595

Markdown

[Mahadevan. "Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results." Machine Learning, 1996.](https://mlanthology.org/mlj/1996/mahadevan1996mlj-average/) doi:10.1023/A:1018064306595

BibTeX

@article{mahadevan1996mlj-average,
  title     = {{Average Reward Reinforcement Learning: Foundations, Algorithms, and Empirical Results}},
  author    = {Mahadevan, Sridhar},
  journal   = {Machine Learning},
  year      = {1996},
  pages     = {159-195},
  doi       = {10.1023/A:1018064306595},
  volume    = {22},
  url       = {https://mlanthology.org/mlj/1996/mahadevan1996mlj-average/}
}