Parameter-Based Value Functions

Abstract

Traditional off-policy actor-critic Reinforcement Learning (RL) algorithms learn value functions of a single target policy. However, when value functions are updated to track the learned policy, they forget potentially useful information about old policies. We introduce a class of value functions called Parameter-Based Value Functions (PBVFs) whose inputs include the policy parameters. They can generalize across different policies. PBVFs can evaluate the performance of any policy given a state, a state-action pair, or a distribution over the RL agent's initial states. First we show how PBVFs yield novel off-policy policy gradient theorems. Then we derive off-policy actor-critic algorithms based on PBVFs trained by Monte Carlo or Temporal Difference methods. We show how learned PBVFs can zero-shot learn new policies that outperform any policy seen during training. Finally our algorithms are evaluated on a selection of discrete and continuous control tasks using shallow policies and deep neural networks. Their performance is comparable to state-of-the-art methods.

Cite

Text

Faccio et al. "Parameter-Based Value Functions." International Conference on Learning Representations, 2021.

Markdown

[Faccio et al. "Parameter-Based Value Functions." International Conference on Learning Representations, 2021.](https://mlanthology.org/iclr/2021/faccio2021iclr-parameterbased/)

BibTeX

@inproceedings{faccio2021iclr-parameterbased,
  title     = {{Parameter-Based Value Functions}},
  author    = {Faccio, Francesco and Kirsch, Louis and Schmidhuber, Jürgen},
  booktitle = {International Conference on Learning Representations},
  year      = {2021},
  url       = {https://mlanthology.org/iclr/2021/faccio2021iclr-parameterbased/}
}