On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction

Abstract

In this paper, we study the convergence properties of off-policy policy optimization algorithms with state-action density ratio correction under function approximation setting, where the objective function is formulated as a max-max-min problem. We first clearly characterize the bias of the learning objective, and then present two strategies with finite-time convergence guarantees. In our first strategy, we propose an algorithm called P-SREDA with convergence rate $O(\epsilon^{-3})$, whose dependency on $\epsilon$ is optimal. Besides, in our second strategy, we design a new off-policy actor-critic style algorithm named O-SPIM. We prove that O-SPIM converges to a stationary point with total complexity $O(\epsilon^{-4})$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.

Cite

Text

Huang and Jiang. "On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction." Artificial Intelligence and Statistics, 2022.

Markdown

[Huang and Jiang. "On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction." Artificial Intelligence and Statistics, 2022.](https://mlanthology.org/aistats/2022/huang2022aistats-convergence/)

BibTeX

@inproceedings{huang2022aistats-convergence,
  title     = {{On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction}},
  author    = {Huang, Jiawei and Jiang, Nan},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2022},
  pages     = {2658-2705},
  volume    = {151},
  url       = {https://mlanthology.org/aistats/2022/huang2022aistats-convergence/}
}