An Analysis of Direct Reinforcement Learning in Non-Markovian Domains
Abstract
It is well known that for Markov decision processes, the policies stable under policy iteration and the standard reinforcement learning methods are exactly the optimal policies. In this paper, we investigate the conditions for policy stability in the more general situation when the Markov property cannot be assumed. We show that for a general class of non-Markov decision processes, if actual return (Monte Carlo) credit assignment is used with undiscounted returns, we are still guaranteed the optimal observation-based policies will be equilibrium points in the policy space when using the standard "direct" reinforcement learning approaches. However, if either discounted rewards, or a temporal differences style of credit assignment method is used, this is not the case. 1 Introduction The techniques of reinforcement learning (RL) have been developed to effect autonomous learning in agents interacting with an initially unknown and possibly changing environment. In its simplest formulation,...
Cite
Text
Pendrith and McGarity. "An Analysis of Direct Reinforcement Learning in Non-Markovian Domains." International Conference on Machine Learning, 1998.Markdown
[Pendrith and McGarity. "An Analysis of Direct Reinforcement Learning in Non-Markovian Domains." International Conference on Machine Learning, 1998.](https://mlanthology.org/icml/1998/pendrith1998icml-analysis/)BibTeX
@inproceedings{pendrith1998icml-analysis,
title = {{An Analysis of Direct Reinforcement Learning in Non-Markovian Domains}},
author = {Pendrith, Mark D. and McGarity, Michael},
booktitle = {International Conference on Machine Learning},
year = {1998},
pages = {421-429},
url = {https://mlanthology.org/icml/1998/pendrith1998icml-analysis/}
}