Multi-Step Dyna Planning for Policy Evaluation and Control
Abstract
We extend Dyna planning architecture for policy evaluation and control in two significant aspects. First, we introduce a multi-step Dyna planning that projects the simulated state/feature many steps into the future. Our multi-step Dyna is based on a multi-step model, which we call the {\em $\lambda$-model}. The $\lambda$-model interpolates between the one-step model and an infinite-step model, and can be learned efficiently online. Second, we use for Dyna control a dynamic multi-step model that is able to predict the results of a sequence of greedy actions and track the optimal policy in the long run. Experimental results show that Dyna using the multi-step model evaluates a policy faster than using single-step models; Dyna control algorithms using the dynamic tracking model are much faster than model-free algorithms; further, multi-step Dyna control algorithms enable the policy and value function to converge much faster to their optima than single-step Dyna algorithms.
Cite
Text
Yao et al. "Multi-Step Dyna Planning for Policy Evaluation and Control." Neural Information Processing Systems, 2009.Markdown
[Yao et al. "Multi-Step Dyna Planning for Policy Evaluation and Control." Neural Information Processing Systems, 2009.](https://mlanthology.org/neurips/2009/yao2009neurips-multistep/)BibTeX
@inproceedings{yao2009neurips-multistep,
title = {{Multi-Step Dyna Planning for Policy Evaluation and Control}},
author = {Yao, Hengshuai and Bhatnagar, Shalabh and Diao, Dongcui and Sutton, Richard S. and Szepesvári, Csaba},
booktitle = {Neural Information Processing Systems},
year = {2009},
pages = {2187-2195},
url = {https://mlanthology.org/neurips/2009/yao2009neurips-multistep/}
}