Near Optimal Reward-Free Reinforcement Learning

Abstract

We study the reward-free reinforcement learning framework, which is particularly suitable for batch reinforcement learning and scenarios where one needs policies for multiple reward functions. This framework has two phases: in the exploration phase, the agent collects trajectories by interacting with the environment without using any reward signal; in the planning phase, the agent needs to return a near-optimal policy for arbitrary reward functions. %This framework is suitable for batch RL setting and the setting where there are multiple reward functions of interes We give a new efficient algorithm, \textbf{S}taged \textbf{S}ampling + \textbf{T}runcated \textbf{P}lanning (\algoname), which interacts with the environment at most $O\left( \frac{S^2A}{\epsilon^2}\poly\log\left(\frac{SAH}{\epsilon}\right) \right)$ episodes in the exploration phase, and guarantees to output a near-optimal policy for arbitrary reward functions in the planning phase, where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon, and $\epsilon$ is the target accuracy relative to the total reward. Notably, our sample complexity scales only \emph{logarithmically} with $H$, in contrast to all existing results which scale \emph{polynomially} with $H$. Furthermore, this bound matches the minimax lower bound $\Omega\left(\frac{S^2A}{\epsilon^2}\right)$ up to logarithmic factors. Our results rely on three new techniques : 1) A new sufficient condition for the dataset to plan for an $\epsilon$-suboptimal policy % for any totally bounded reward function ; 2) A new way to plan efficiently under the proposed condition using soft-truncated planning; 3) Constructing extended MDP to maximize the truncated accumulative rewards efficiently.

Cite

Text

Zhang et al. "Near Optimal Reward-Free Reinforcement Learning." International Conference on Machine Learning, 2021.

Markdown

[Zhang et al. "Near Optimal Reward-Free Reinforcement Learning." International Conference on Machine Learning, 2021.](https://mlanthology.org/icml/2021/zhang2021icml-near/)

BibTeX

@inproceedings{zhang2021icml-near,
  title     = {{Near Optimal Reward-Free Reinforcement Learning}},
  author    = {Zhang, Zihan and Du, Simon and Ji, Xiangyang},
  booktitle = {International Conference on Machine Learning},
  year      = {2021},
  pages     = {12402-12412},
  volume    = {139},
  url       = {https://mlanthology.org/icml/2021/zhang2021icml-near/}
}