Multi-Step Dyna Planning for Policy Evaluation and Control

Abstract

We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\em $\lambda$-model}. The $\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.

Cite

Text

Yao et al. "Multi-Step Dyna Planning for Policy Evaluation and Control." Neural Information Processing Systems, 2009.

Markdown

[Yao et al. "Multi-Step Dyna Planning for Policy Evaluation and Control." Neural Information Processing Systems, 2009.](https://mlanthology.org/neurips/2009/yao2009neurips-multistep/)

BibTeX

@inproceedings{yao2009neurips-multistep,
  title     = {{Multi-Step Dyna Planning for Policy Evaluation and Control}},
  author    = {Yao, Hengshuai and Bhatnagar, Shalabh and Diao, Dongcui and Sutton, Richard S. and Szepesvári, Csaba},
  booktitle = {Neural Information Processing Systems},
  year      = {2009},
  pages     = {2187-2195},
  url       = {https://mlanthology.org/neurips/2009/yao2009neurips-multistep/}
}