On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Abstract

On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent's current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This \textit{sampling error} leads to high-variance gradient estimates that yield data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d.\@, off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce~\citep{zhong2022robust}. Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a \textit{behavior policy} that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

Cite

Text

Corrado and Hanna. "On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling." Transactions on Machine Learning Research, 2026.

Markdown

[Corrado and Hanna. "On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/corrado2026tmlr-onpolicy/)

BibTeX

@article{corrado2026tmlr-onpolicy,
  title     = {{On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling}},
  author    = {Corrado, Nicholas E. and Hanna, Josiah P.},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/corrado2026tmlr-onpolicy/}
}