Timing and Partial Observability in the Dopamine System

Nathaniel D. Daw, Aaron C. Courville, David S. Touretzky

NeurIPS 2002 pp. 99-106

/neurips/2002/daw2002neurips-timing/

Abstract

According to a series of inﬂuential models, dopamine (DA) neurons sig- nal reward prediction error using a temporal-difference (TD) algorithm. We address a problem not convincingly solved in these accounts: how to maintain a representation of cues that predict delayed consequences. Our new model uses a TD rule grounded in partially observable semi-Markov processes, a formalism that captures two largely neglected features of DA experiments: hidden state and temporal variability. Previous models pre- dicted rewards using a tapped delay line representation of sensory inputs; we replace this with a more active process of inference about the under- lying state of the world. The DA system can then learn to map these inferred states to reward predictions using TD. The new model can ex- plain previously vexing data on the responses of DA neurons in the face of temporal variability. By combining statistical model-based learning with a physiologically grounded TD theory, it also brings into contact with physiology some insights about behavior that had previously been conﬁned to more abstract psychological models.

PDF NeurIPS Semantic Scholar

Cite

Text

Daw et al. "Timing and Partial Observability in the Dopamine System." Neural Information Processing Systems, 2002.

Markdown

[Daw et al. "Timing and Partial Observability in the Dopamine System." Neural Information Processing Systems, 2002.](https://mlanthology.org/neurips/2002/daw2002neurips-timing/)

BibTeX

@inproceedings{daw2002neurips-timing,
  title     = {{Timing and Partial Observability in the Dopamine System}},
  author    = {Daw, Nathaniel D. and Courville, Aaron C. and Touretzky, David S.},
  booktitle = {Neural Information Processing Systems},
  year      = {2002},
  pages     = {99-106},
  url       = {https://mlanthology.org/neurips/2002/daw2002neurips-timing/}
}