Towards Tractable Optimism in Model-Based Reinforcement Learning

Abstract

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: $\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )$ when augmenting using Gaussian noise, where $T$ is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.

Cite

Text

Pacchiano et al. "Towards Tractable Optimism in Model-Based Reinforcement Learning." Uncertainty in Artificial Intelligence, 2021.

Markdown

[Pacchiano et al. "Towards Tractable Optimism in Model-Based Reinforcement Learning." Uncertainty in Artificial Intelligence, 2021.](https://mlanthology.org/uai/2021/pacchiano2021uai-tractable/)

BibTeX

@inproceedings{pacchiano2021uai-tractable,
  title     = {{Towards Tractable Optimism in Model-Based Reinforcement Learning}},
  author    = {Pacchiano, Aldo and Ball, Philip and Parker-Holder, Jack and Choromanski, Krzysztof and Roberts, Stephen},
  booktitle = {Uncertainty in Artificial Intelligence},
  year      = {2021},
  pages     = {1413-1423},
  volume    = {161},
  url       = {https://mlanthology.org/uai/2021/pacchiano2021uai-tractable/}
}