Q-Learning in Continuous Time

Abstract

We study the continuous-time counterpart of Q-learning for reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation introduced by Wang et al. (2020). As the conventional (big) Q-function collapses in continuous time, we consider its first-order approximation and coin the term “(little) q-function". This function is related to the instantaneous advantage rate function as well as the Hamiltonian. We develop a “q-learning" theory around the q-function that is independent of time discretization. Given a stochastic policy, we jointly characterize the associated q-function and value function by martingale conditions of certain stochastic processes, in both on-policy and off-policy settings. We then apply the theory to devise different actor--critic algorithms for solving underlying RL problems, depending on whether or not the density function of the Gibbs measure generated from the q-function can be computed explicitly. One of our algorithms interprets the well-known Q-learning algorithm SARSA, and another recovers a policy gradient (PG) based continuous-time algorithm proposed in Jia and Zhou (2022b). Finally, we conduct simulation experiments to compare the performance of our algorithms with those of PG-based algorithms in Jia and Zhou (2022b) and time-discretized conventional Q-learning algorithms.

Cite

Text

Jia and Zhou. "Q-Learning in Continuous Time." Journal of Machine Learning Research, 2023.

Markdown

[Jia and Zhou. "Q-Learning in Continuous Time." Journal of Machine Learning Research, 2023.](https://mlanthology.org/jmlr/2023/jia2023jmlr-qlearning/)

BibTeX

@article{jia2023jmlr-qlearning,
  title     = {{Q-Learning in Continuous Time}},
  author    = {Jia, Yanwei and Zhou, Xun Yu},
  journal   = {Journal of Machine Learning Research},
  year      = {2023},
  pages     = {1-61},
  volume    = {24},
  url       = {https://mlanthology.org/jmlr/2023/jia2023jmlr-qlearning/}
}