Reinforcement Learning with Function Approximation Converges to a Region

Abstract

Many algorithms for approximate reinforcement learning are not known to converge. In fact, there are counterexamples showing that the adjustable weights in some algorithms may oscillate within a region rather than converging to a point. This paper shows that, for two popular algorithms, such oscillation is the worst that can happen: the weights cannot diverge, but instead must converge to a bounded region. The algorithms are SARSA(O) and V(O); the latter algorithm was used in the well-known TD-Gammon program.

Cite

Text

Gordon. "Reinforcement Learning with Function Approximation Converges to a Region." Neural Information Processing Systems, 2000.

Markdown

[Gordon. "Reinforcement Learning with Function Approximation Converges to a Region." Neural Information Processing Systems, 2000.](https://mlanthology.org/neurips/2000/gordon2000neurips-reinforcement/)

BibTeX

@inproceedings{gordon2000neurips-reinforcement,
  title     = {{Reinforcement Learning with Function Approximation Converges to a Region}},
  author    = {Gordon, Geoffrey J.},
  booktitle = {Neural Information Processing Systems},
  year      = {2000},
  pages     = {1040-1046},
  url       = {https://mlanthology.org/neurips/2000/gordon2000neurips-reinforcement/}
}