Reinforcement Learning for POMDPs Based on Action Values and Stochastic Optimization

Abstract

We present a new, model-free reinforcement learning algorithm for learning to control partially-observable Markov decision processes. The algorithm incorporates ideas from action-value based reinforcement learning approaches, such as Q-Learning, as well as ideas from the stochastic optimization literature. Key to our approach is a new definition of action value, which makes the algorithm theoretically sound for partially-observable settings. We show that special cases of our algorithm can achieve probability one convergence to locally optimal policies in the limit, or probably approximately correct hill-climbing to a locally optimal policy in a finite number of samples.

Cite

Text

Perkins. "Reinforcement Learning for POMDPs Based on Action Values and Stochastic Optimization." AAAI Conference on Artificial Intelligence, 2002. doi:10.5555/777092.777126

Markdown

[Perkins. "Reinforcement Learning for POMDPs Based on Action Values and Stochastic Optimization." AAAI Conference on Artificial Intelligence, 2002.](https://mlanthology.org/aaai/2002/perkins2002aaai-reinforcement/) doi:10.5555/777092.777126

BibTeX

@inproceedings{perkins2002aaai-reinforcement,
  title     = {{Reinforcement Learning for POMDPs Based on Action Values and Stochastic Optimization}},
  author    = {Perkins, Theodore J.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2002},
  pages     = {199-204},
  doi       = {10.5555/777092.777126},
  url       = {https://mlanthology.org/aaai/2002/perkins2002aaai-reinforcement/}
}