On Q-Learning Convergence for Non-Markov Decision Processes

Abstract

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model- free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.

Cite

Text

Majeed and Hutter. "On Q-Learning Convergence for Non-Markov Decision Processes." International Joint Conference on Artificial Intelligence, 2018. doi:10.24963/IJCAI.2018/353

Markdown

[Majeed and Hutter. "On Q-Learning Convergence for Non-Markov Decision Processes." International Joint Conference on Artificial Intelligence, 2018.](https://mlanthology.org/ijcai/2018/majeed2018ijcai-q/) doi:10.24963/IJCAI.2018/353

BibTeX

@inproceedings{majeed2018ijcai-q,
  title     = {{On Q-Learning Convergence for Non-Markov Decision Processes}},
  author    = {Majeed, Sultan Javed and Hutter, Marcus},
  booktitle = {International Joint Conference on Artificial Intelligence},
  year      = {2018},
  pages     = {2546-2552},
  doi       = {10.24963/IJCAI.2018/353},
  url       = {https://mlanthology.org/ijcai/2018/majeed2018ijcai-q/}
}