Gradient Descent for General Reinforcement Learning

Abstract

A simple learning rule is derived, the VAPS algorithm, which can be instantiated to generate a wide range of new reinforcement(cid:173) learning algorithms. These algorithms solve a number of open problems, define several new approaches to reinforcement learning, and unify different approaches to reinforcement learning under a single theory. These algorithms all have guaranteed convergence, and include modifications of several existing algorithms that were known to fail to converge on simple MOPs. These include Q(cid:173) In addition to these learning, SARSA, and advantage learning. it also generates pure policy-search value-based algorithms reinforcement-learning algorithms, which learn optimal policies without learning a value function. search and value-based algorithms to be combined, thus unifying two very different approaches to reinforcement learning into a single Value and Policy Search (V APS) algorithm. And these algorithms converge for POMDPs without requiring a proper belief state . Simulations results are given, and several areas for future research are discussed.

Cite

Text

Iii and Moore. "Gradient Descent for General Reinforcement Learning." Neural Information Processing Systems, 1998.

Markdown

[Iii and Moore. "Gradient Descent for General Reinforcement Learning." Neural Information Processing Systems, 1998.](https://mlanthology.org/neurips/1998/iii1998neurips-gradient/)

BibTeX

@inproceedings{iii1998neurips-gradient,
  title     = {{Gradient Descent for General Reinforcement Learning}},
  author    = {Iii, Leemon C. Baird and Moore, Andrew W.},
  booktitle = {Neural Information Processing Systems},
  year      = {1998},
  pages     = {968-974},
  url       = {https://mlanthology.org/neurips/1998/iii1998neurips-gradient/}
}