Policy Gradient with Kernel Quadrature

Abstract

Reward evaluation of episodes becomes a bottleneck in a broad range of reinforcement learning tasks. Our aim in this paper is to select a small but representative subset of a large batch of episodes, only on which we actually compute rewards for more efficient policy gradient iterations. We build a Gaussian process modeling of discounted returns or rewards to derive a positive definite kernel on the space of episodes, run an ``episodic" kernel quadrature method to compress the information of sample episodes, and pass the reduced episodes to the policy network for gradient updates. We present the theoretical background of this procedure as well as its numerical illustrations in MuJoCo tasks.

Cite

Text

Hayakawa and Morimura. "Policy Gradient with Kernel Quadrature." Transactions on Machine Learning Research, 2024.

Markdown

[Hayakawa and Morimura. "Policy Gradient with Kernel Quadrature." Transactions on Machine Learning Research, 2024.](https://mlanthology.org/tmlr/2024/hayakawa2024tmlr-policy/)

BibTeX

@article{hayakawa2024tmlr-policy,
  title     = {{Policy Gradient with Kernel Quadrature}},
  author    = {Hayakawa, Satoshi and Morimura, Tetsuro},
  journal   = {Transactions on Machine Learning Research},
  year      = {2024},
  url       = {https://mlanthology.org/tmlr/2024/hayakawa2024tmlr-policy/}
}