Sample Complexity of Policy-Based Methods Under Off-Policy Sampling and Linear Function Approximation
Abstract
In this work, we study policy-based methods for solving the reinforcement learning problem, where off-policy sampling and linear function approximation are employed for policy evaluation, and various policy update rules (including natural policy gradient) are considered for policy improvement. To solve the policy evaluation sub-problem in the presence of the deadly triad, we propose a generic algorithm framework of multi-step TD-learning with generalized importance sampling ratios, which includes two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace. The generic algorithm is single time-scale, has provable finite-sample guarantees, and overcomes the high variance issue in off-policy learning. As for the policy improvement, we provide a universal analysis that establishes geometric convergence of various policy update rules, which leads to an overall $\Tilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity.
Cite
Text
Chen and Theja Maguluri. "Sample Complexity of Policy-Based Methods Under Off-Policy Sampling and Linear Function Approximation." Artificial Intelligence and Statistics, 2022.Markdown
[Chen and Theja Maguluri. "Sample Complexity of Policy-Based Methods Under Off-Policy Sampling and Linear Function Approximation." Artificial Intelligence and Statistics, 2022.](https://mlanthology.org/aistats/2022/chen2022aistats-sample/)BibTeX
@inproceedings{chen2022aistats-sample,
title = {{Sample Complexity of Policy-Based Methods Under Off-Policy Sampling and Linear Function Approximation}},
author = {Chen, Zaiwei and Theja Maguluri, Siva},
booktitle = {Artificial Intelligence and Statistics},
year = {2022},
pages = {11195-11214},
volume = {151},
url = {https://mlanthology.org/aistats/2022/chen2022aistats-sample/}
}