Combining Policy Gradient and Q-Learning

Abstract

So far in this book, in the context of deep learning combined with reinforcement learning, we have looked at deep Q-learning with its variants in Chapter 6 and at policy gradients in Chapter 7 . Neural network training requires multiple iterations, and Q-learning, an off-policy approach, enables us to use transitions multiple times, giving us sample efficiency. However, Q-learning can be unstable at times. Further, it is an indirect way of learning. Instead of learning an optimal policy directly, we first learn Q-values and then use these action values to learn the optimal behavior. In Chapter 7 , we looked at the approach of learning a policy directly, giving us much better improvement guarantees. However, all the policies we looked at Chapter 7 were on-policy. We used a policy to interact with the environment and make updates to the policy weights to increase the probability of good trajectories/actions while reducing the probability of bad ones. However, we carried out the learning on-policy as the previous transitions become invalid after the update to policy weights.

Cite

Text

O'Donoghue et al. "Combining Policy Gradient and Q-Learning." International Conference on Learning Representations, 2017. doi:10.1007/978-1-4842-6809-4_8

Markdown

[O'Donoghue et al. "Combining Policy Gradient and Q-Learning." International Conference on Learning Representations, 2017.](https://mlanthology.org/iclr/2017/oaposdonoghue2017iclr-combining/) doi:10.1007/978-1-4842-6809-4_8

BibTeX

@inproceedings{oaposdonoghue2017iclr-combining,
  title     = {{Combining Policy Gradient and Q-Learning}},
  author    = {O'Donoghue, Brendan and Munos, Rémi and Kavukcuoglu, Koray and Mnih, Volodymyr},
  booktitle = {International Conference on Learning Representations},
  year      = {2017},
  doi       = {10.1007/978-1-4842-6809-4_8},
  url       = {https://mlanthology.org/iclr/2017/oaposdonoghue2017iclr-combining/}
}