An Actor/Critic Algorithm That Is Equivalent to Q-Learning

Abstract

We prove the convergence of an actor/critic algorithm that is equiv(cid:173) alent to Q-Iearning by construction. Its equivalence is achieved by encoding Q-values within the policy and value function of the ac(cid:173) tor and critic. The resultant actor/critic algorithm is novel in two ways: it updates the critic only when the most probable action is executed from any given state, and it rewards the actor using cri(cid:173) teria that depend on the relative probability of the action that was executed.

Cite

Text

Crites and Barto. "An Actor/Critic Algorithm That Is Equivalent to Q-Learning." Neural Information Processing Systems, 1994.

Markdown

[Crites and Barto. "An Actor/Critic Algorithm That Is Equivalent to Q-Learning." Neural Information Processing Systems, 1994.](https://mlanthology.org/neurips/1994/crites1994neurips-actor/)

BibTeX

@inproceedings{crites1994neurips-actor,
  title     = {{An Actor/Critic Algorithm That Is Equivalent to Q-Learning}},
  author    = {Crites, Robert H. and Barto, Andrew G.},
  booktitle = {Neural Information Processing Systems},
  year      = {1994},
  pages     = {401-408},
  url       = {https://mlanthology.org/neurips/1994/crites1994neurips-actor/}
}