Smoothing Policies and Safe Policy Gradients

Abstract

Policy gradient (PG) algorithms are among the best candidates for the much-anticipated applications of reinforcement learning to real-world control tasks, such as robotics. However, the trial-and-error nature of these methods poses safety issues whenever the learning process itself must be performed on a physical system or involves any form of human-computer interaction. In this paper, we address a specific safety formulation, where both goals and dangers are encoded in a scalar reward signal and the learning agent is constrained to never worsen its performance, measured as the expected sum of rewards. By studying actor-only PG from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of PG estimators, allows us to identify meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimates. Through a joint, adaptive selection of these meta-parameters, we obtain a PG algorithm with monotonic improvement guarantees.

Cite

Text

Papini et al. "Smoothing Policies and Safe Policy Gradients." Machine Learning, 2022. doi:10.1007/S10994-022-06232-6

Markdown

[Papini et al. "Smoothing Policies and Safe Policy Gradients." Machine Learning, 2022.](https://mlanthology.org/mlj/2022/papini2022mlj-smoothing/) doi:10.1007/S10994-022-06232-6

BibTeX

@article{papini2022mlj-smoothing,
  title     = {{Smoothing Policies and Safe Policy Gradients}},
  author    = {Papini, Matteo and Pirotta, Matteo and Restelli, Marcello},
  journal   = {Machine Learning},
  year      = {2022},
  pages     = {4081-4137},
  doi       = {10.1007/S10994-022-06232-6},
  volume    = {111},
  url       = {https://mlanthology.org/mlj/2022/papini2022mlj-smoothing/}
}