Differential Eligibility Vectors for Advantage Updating and Gradient Methods

Abstract

In this paper we propose differential eligibility vectors (DEV) for temporal-difference (TD) learning, a new class of eligibility vectors designed to bring out the contribution of each action in the TD-error at each state. Specifically, we use DEV in TD-Q(lambda) to more accurately learn the relative value of the actions, rather than their absolute value. We identify conditions that ensure convergence w.p.1 of TD-Q(lambda) with DEV and show that this algorithm can also be used to directly approximate the advantage function associated with a given policy, without the need to compute an auxiliary function - something that, to the extent of our knowledge, was not known possible. Finally, we discuss the integration of DEV in LSTDQ and actor-critic algorithms.

Cite

Text

Melo. "Differential Eligibility Vectors for Advantage Updating and Gradient Methods." AAAI Conference on Artificial Intelligence, 2011. doi:10.1609/AAAI.V25I1.7938

Markdown

[Melo. "Differential Eligibility Vectors for Advantage Updating and Gradient Methods." AAAI Conference on Artificial Intelligence, 2011.](https://mlanthology.org/aaai/2011/melo2011aaai-differential/) doi:10.1609/AAAI.V25I1.7938

BibTeX

@inproceedings{melo2011aaai-differential,
  title     = {{Differential Eligibility Vectors for Advantage Updating and Gradient Methods}},
  author    = {Melo, Francisco S.},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2011},
  pages     = {441-446},
  doi       = {10.1609/AAAI.V25I1.7938},
  url       = {https://mlanthology.org/aaai/2011/melo2011aaai-differential/}
}