Convergence of Optimistic and Incremental Q-Learning
Abstract
Vie sho,v the convergence of tV/O deterministic variants of Q(cid:173) learning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an E(cid:173) optimal policy. The second is a new and novel algorithm incremen(cid:173) tal Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy. Our incremental Q-learning algo(cid:173) rithm can be viewed as derandomization of the E-greedy Q-learning.
Cite
Text
Even-dar and Mansour. "Convergence of Optimistic and Incremental Q-Learning." Neural Information Processing Systems, 2001.Markdown
[Even-dar and Mansour. "Convergence of Optimistic and Incremental Q-Learning." Neural Information Processing Systems, 2001.](https://mlanthology.org/neurips/2001/evendar2001neurips-convergence/)BibTeX
@inproceedings{evendar2001neurips-convergence,
title = {{Convergence of Optimistic and Incremental Q-Learning}},
author = {Even-dar, Eyal and Mansour, Yishay},
booktitle = {Neural Information Processing Systems},
year = {2001},
pages = {1499-1506},
url = {https://mlanthology.org/neurips/2001/evendar2001neurips-convergence/}
}