Policy Gradient in Lipschitz Markov Decision Processes

Abstract

This paper is about the exploitation of Lipschitz continuity properties for Markov Decision Processes to safely speed up policy-gradient algorithms. Starting from assumptions about the Lipschitz continuity of the state-transition model, the reward function, and the policies considered in the learning process, we show that both the expected return of a policy and its gradient are Lipschitz continuous w.r.t. policy parameters. By leveraging such properties, we define policy-parameter updates that guarantee a performance improvement at each iteration. The proposed methods are empirically evaluated and compared to other related approaches using different configurations of three popular control scenarios: the linear quadratic regulator, the mass-spring-damper system and the ship-steering control.

Cite

Text

Pirotta et al. "Policy Gradient in Lipschitz Markov Decision Processes." Machine Learning, 2015. doi:10.1007/S10994-015-5484-1

Markdown

[Pirotta et al. "Policy Gradient in Lipschitz Markov Decision Processes." Machine Learning, 2015.](https://mlanthology.org/mlj/2015/pirotta2015mlj-policy/) doi:10.1007/S10994-015-5484-1

BibTeX

@article{pirotta2015mlj-policy,
  title     = {{Policy Gradient in Lipschitz Markov Decision Processes}},
  author    = {Pirotta, Matteo and Restelli, Marcello and Bascetta, Luca},
  journal   = {Machine Learning},
  year      = {2015},
  pages     = {255-283},
  doi       = {10.1007/S10994-015-5484-1},
  volume    = {100},
  url       = {https://mlanthology.org/mlj/2015/pirotta2015mlj-policy/}
}