An Actor/Critic Algorithm That Is Equivalent to Q-Learning
Abstract
We prove the convergence of an actor/critic algorithm that is equiv(cid:173) alent to Q-Iearning by construction. Its equivalence is achieved by encoding Q-values within the policy and value function of the ac(cid:173) tor and critic. The resultant actor/critic algorithm is novel in two ways: it updates the critic only when the most probable action is executed from any given state, and it rewards the actor using cri(cid:173) teria that depend on the relative probability of the action that was executed.
Cite
Text
Crites and Barto. "An Actor/Critic Algorithm That Is Equivalent to Q-Learning." Neural Information Processing Systems, 1994.Markdown
[Crites and Barto. "An Actor/Critic Algorithm That Is Equivalent to Q-Learning." Neural Information Processing Systems, 1994.](https://mlanthology.org/neurips/1994/crites1994neurips-actor/)BibTeX
@inproceedings{crites1994neurips-actor,
title = {{An Actor/Critic Algorithm That Is Equivalent to Q-Learning}},
author = {Crites, Robert H. and Barto, Andrew G.},
booktitle = {Neural Information Processing Systems},
year = {1994},
pages = {401-408},
url = {https://mlanthology.org/neurips/1994/crites1994neurips-actor/}
}