Policy Optimization via Importance Sampling

Abstract

Policy optimization is an effective reinforcement learning approach to solve continuous control tasks. Recent achievements have shown that alternating online and offline optimization is a successful choice for efficient trajectory reuse. However, deciding when to stop optimizing and collect new trajectories is non-trivial, as it requires to account for the variance of the objective function estimate. In this paper, we propose a novel, model-free, policy search algorithm, POIS, applicable in both action-based and parameter-based settings. We first derive a high-confidence bound for importance sampling estimation; then we define a surrogate objective function, which is optimized offline whenever a new batch of trajectories is collected. Finally, the algorithm is tested on a selection of continuous control tasks, with both linear and deep policies, and compared with state-of-the-art policy optimization methods.

Cite

Text

Metelli et al. "Policy Optimization via Importance Sampling." Neural Information Processing Systems, 2018.

Markdown

[Metelli et al. "Policy Optimization via Importance Sampling." Neural Information Processing Systems, 2018.](https://mlanthology.org/neurips/2018/metelli2018neurips-policy/)

BibTeX

@inproceedings{metelli2018neurips-policy,
  title     = {{Policy Optimization via Importance Sampling}},
  author    = {Metelli, Alberto Maria and Papini, Matteo and Faccio, Francesco and Restelli, Marcello},
  booktitle = {Neural Information Processing Systems},
  year      = {2018},
  pages     = {5442-5454},
  url       = {https://mlanthology.org/neurips/2018/metelli2018neurips-policy/}
}