A Variational Formulation of Reinforcement Learning in Infinite-Horizon Markov Decision Processes

Abstract

Reinforcement learning in infinite-horizon Markov decision processes (MDPs) is typically framed as expected discounted return maximization. In this paper, we formulate an alternative principle for optimal sequential decision-making in infinite-horizon MDPs: variational Bayesian inference in transdimensional probabilistic models. In particular, we specify a probabilistic model over random-length state--action trajectories and consider the variational problem of finding an approximation to the posterior distribution over random-length state--action trajectories conditioned on state--action trajectories that reflect some desired behavior. We derive a tractable variational objective for infinite-horizon settings, prove a variational dynamic-discount policy iteration theorem, show that fixed discount factor KL-regularized reinforcement learning objectives are special cases of dynamic-discount variational objectives, and prove that learning dynamic discount factors is optimal.

Cite

Text

Rudner. "A Variational Formulation of Reinforcement Learning in Infinite-Horizon Markov Decision Processes." ICML 2024 Workshops: RLControlTheory, 2024.

Markdown

[Rudner. "A Variational Formulation of Reinforcement Learning in Infinite-Horizon Markov Decision Processes." ICML 2024 Workshops: RLControlTheory, 2024.](https://mlanthology.org/icmlw/2024/rudner2024icmlw-variational/)

BibTeX

@inproceedings{rudner2024icmlw-variational,
  title     = {{A Variational Formulation of Reinforcement Learning in Infinite-Horizon Markov Decision Processes}},
  author    = {Rudner, Tim G. J.},
  booktitle = {ICML 2024 Workshops: RLControlTheory},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/rudner2024icmlw-variational/}
}