Improving Policies Without Measuring Merits
Abstract
Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid:173) vantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function . For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages.
Cite
Text
Dayan and Singh. "Improving Policies Without Measuring Merits." Neural Information Processing Systems, 1995.Markdown
[Dayan and Singh. "Improving Policies Without Measuring Merits." Neural Information Processing Systems, 1995.](https://mlanthology.org/neurips/1995/dayan1995neurips-improving/)BibTeX
@inproceedings{dayan1995neurips-improving,
title = {{Improving Policies Without Measuring Merits}},
author = {Dayan, Peter and Singh, Satinder P.},
booktitle = {Neural Information Processing Systems},
year = {1995},
pages = {1059-1065},
url = {https://mlanthology.org/neurips/1995/dayan1995neurips-improving/}
}