Periodic Q-Learning

Abstract

The use of target networks is a common practice in deep reinforcement learning for stabilizing the training; however, theoretical understanding of this technique is still limited. In this paper, we study the so-called periodic Q-learning algorithm (PQ-learning for short), which resembles the technique used in deep Q-learning for solving infinite-horizon discounted Markov decision processes (DMDP) in the tabular setting. PQ-learning maintains two separate Q-value estimates – the online estimate and target estimate. The online estimate follows the standard Q-learning update, while the target estimate is updated periodically. In contrast to the standard Q-learning, PQ-learning enjoys a simple finite time analysis and achieves better sample complexity for finding an epsilon-optimal policy. Our result provides a preliminary justification of the effectiveness of utilizing target estimates or networks in Q-learning algorithms.

Cite

Text

Lee and He. "Periodic Q-Learning." Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.

Markdown

[Lee and He. "Periodic Q-Learning." Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.](https://mlanthology.org/l4dc/2020/lee2020l4dc-periodic/)

BibTeX

@inproceedings{lee2020l4dc-periodic,
  title     = {{Periodic Q-Learning}},
  author    = {Lee, Donghwan and He, Niao},
  booktitle = {Proceedings of the 2nd Conference on Learning for Dynamics and Control},
  year      = {2020},
  pages     = {582-598},
  volume    = {120},
  url       = {https://mlanthology.org/l4dc/2020/lee2020l4dc-periodic/}
}