Markov Decision Processes in Large State Spaces

Abstract

In this paper we propose a new framework for studying Markov decision processes (MDPs), based on ideas from statistical mechanics. The goal of learning in MDPs is to find a policy that yields the maximum expected return over time. In choosing policies, agents must therefore weigh the prospects of short-term versus long-term gains. We study a simple MDP in which the agent must constantly decide between exploratory jumps and local reward mining in state space. The number of policies to choose from grows exponentially with the size of the state space, N . We view the expected returns as defining an energy landscape over policy space. Methods from statistical mechanics are used to analyze this landscape in the thermodynamic limit N ! 1. We calculate the overall distribution of expected returns, as well as the distribution of returns for policies at a fixed Hamming distance from the optimal one. We briefly discuss the problem of learning optimal policies from empirical estimates of the expe...

Cite

Text

Saul and Singh. "Markov Decision Processes in Large State Spaces." Annual Conference on Computational Learning Theory, 1995. doi:10.1145/225298.225332

Markdown

[Saul and Singh. "Markov Decision Processes in Large State Spaces." Annual Conference on Computational Learning Theory, 1995.](https://mlanthology.org/colt/1995/saul1995colt-markov/) doi:10.1145/225298.225332

BibTeX

@inproceedings{saul1995colt-markov,
  title     = {{Markov Decision Processes in Large State Spaces}},
  author    = {Saul, Lawrence K. and Singh, Satinder P.},
  booktitle = {Annual Conference on Computational Learning Theory},
  year      = {1995},
  pages     = {281-288},
  doi       = {10.1145/225298.225332},
  url       = {https://mlanthology.org/colt/1995/saul1995colt-markov/}
}