Ranking Policy Gradient

Lin, Kaixiang; Zhou, Jiayu

Ranking Policy Gradient

ICLR 2020

/iclr/2020/lin2020iclr-ranking/

Abstract

Sample inefficiency is a long-lasting problem in reinforcement learning (RL). The state-of-the-art estimates the optimal action values while it usually involves an extensive search over the state-action space and unstable optimization. Towards the sample-efficient RL, we propose ranking policy gradient (RPG), a policy gradient method that learns the optimal rank of a set of discrete actions. To accelerate the learning of policy gradient methods, we establish the equivalence between maximizing the lower bound of return and imitating a near-optimal policy without accessing any oracles. These results lead to a general off-policy learning framework, which preserves the optimality, reduces variance, and improves the sample-efficiency. We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art.

PDF ICLR Semantic Scholar

Cite

Text

Lin and Zhou. "Ranking Policy Gradient." International Conference on Learning Representations, 2020.

Markdown

[Lin and Zhou. "Ranking Policy Gradient." International Conference on Learning Representations, 2020.](https://mlanthology.org/iclr/2020/lin2020iclr-ranking/)

BibTeX

@inproceedings{lin2020iclr-ranking,
  title     = {{Ranking Policy Gradient}},
  author    = {Lin, Kaixiang and Zhou, Jiayu},
  booktitle = {International Conference on Learning Representations},
  year      = {2020},
  url       = {https://mlanthology.org/iclr/2020/lin2020iclr-ranking/}
}