The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Abstract

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered.This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost.Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small.As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks.In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs.As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.

Cite

Text

Wang et al. "The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning." Neural Information Processing Systems, 2023.

Markdown

[Wang et al. "The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/wang2023neurips-benefits/)

BibTeX

@inproceedings{wang2023neurips-benefits,
  title     = {{The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning}},
  author    = {Wang, Kaiwen and Zhou, Kevin and Wu, Runzhe and Kallus, Nathan and Sun, Wen},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/wang2023neurips-benefits/}
}