Learning While Exploring: Bridging the Gaps in the Eligibility Traces

Abstract

The reinforcement learning algorithm TD(λ) applied to Markov decision processes is known to need added exploration in many cases. With the usual implementations of exploration in TD-learning, the feedback signals are either distorted or discarded, so that the exploration hurts the algorithm’s learning. The present article gives a modification of the TD-learning algorithm that allows exploration without cost to the accuracy or speed of learning. The idea is that when the learning agent performs an action it perceives as inferior, it is compensated for its loss, that is, it is given an additional reward equal to its estimated cost of making the exploring move. This modification is compatible with existing exploration strategies, and is seen to work well when applied to a simple grid-world problem, even when always exploring completely at random.

Cite

Text

Dahl and Halck. "Learning While Exploring: Bridging the Gaps in the Eligibility Traces." European Conference on Machine Learning, 2001. doi:10.1007/3-540-44795-4_7

Markdown

[Dahl and Halck. "Learning While Exploring: Bridging the Gaps in the Eligibility Traces." European Conference on Machine Learning, 2001.](https://mlanthology.org/ecmlpkdd/2001/dahl2001ecml-learning/) doi:10.1007/3-540-44795-4_7

BibTeX

@inproceedings{dahl2001ecml-learning,
  title     = {{Learning While Exploring: Bridging the Gaps in the Eligibility Traces}},
  author    = {Dahl, Fredrik A. and Halck, Ole Martin},
  booktitle = {European Conference on Machine Learning},
  year      = {2001},
  pages     = {73-84},
  doi       = {10.1007/3-540-44795-4_7},
  url       = {https://mlanthology.org/ecmlpkdd/2001/dahl2001ecml-learning/}
}