Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward
Abstract
Reinforcement learning systems are often required to find not deterministic policies, but stochastic ones. They are also required to gain more reward while learning. Q-learning has not been designed for stochastic policies, and does not guarantee rational behavior on the halfway of learning. This paper presents a new reinforcement learning approach based on a simple credit-assignment for finding memory-less policies. It satisfies the above requirements with considering the policy and the exploration strategy identically. The mathematical analysis shows the proposed method is a stochastic gradient ascent on discounted reward in Markov decision processes (MDPs), and is related to the average-reward framework. The analysis assures that the proposed method can be extended to continuous environments. We also investigate its behavior in comparison with Q-learning on a small MDP example and a non-Markovian one.
Cite
Text
Kimura et al. "Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward." International Conference on Machine Learning, 1995. doi:10.1016/B978-1-55860-377-6.50044-XMarkdown
[Kimura et al. "Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward." International Conference on Machine Learning, 1995.](https://mlanthology.org/icml/1995/kimura1995icml-reinforcement/) doi:10.1016/B978-1-55860-377-6.50044-XBibTeX
@inproceedings{kimura1995icml-reinforcement,
title = {{Reinforcement Learning by Stochastic Hill Climbing on Discounted Reward}},
author = {Kimura, Hajime and Yamamura, Masayuki and Kobayashi, Shigenobu},
booktitle = {International Conference on Machine Learning},
year = {1995},
pages = {295-303},
doi = {10.1016/B978-1-55860-377-6.50044-X},
url = {https://mlanthology.org/icml/1995/kimura1995icml-reinforcement/}
}