Markov Decision Processes in Large State Spaces
Abstract
In this paper we propose a new framework for studying Markov decision processes (MDPs), based on ideas from statistical mechanics. The goal of learning in MDPs is to find a policy that yields the maximum expected return over time. In choosing policies, agents must therefore weigh the prospects of short-term versus long-term gains. We study a simple MDP in which the agent must constantly decide between exploratory jumps and local reward mining in state space. The number of policies to choose from grows exponentially with the size of the state space, N . We view the expected returns as defining an energy landscape over policy space. Methods from statistical mechanics are used to analyze this landscape in the thermodynamic limit N ! 1. We calculate the overall distribution of expected returns, as well as the distribution of returns for policies at a fixed Hamming distance from the optimal one. We briefly discuss the problem of learning optimal policies from empirical estimates of the expe...
Cite
Text
Saul and Singh. "Markov Decision Processes in Large State Spaces." Annual Conference on Computational Learning Theory, 1995. doi:10.1145/225298.225332Markdown
[Saul and Singh. "Markov Decision Processes in Large State Spaces." Annual Conference on Computational Learning Theory, 1995.](https://mlanthology.org/colt/1995/saul1995colt-markov/) doi:10.1145/225298.225332BibTeX
@inproceedings{saul1995colt-markov,
title = {{Markov Decision Processes in Large State Spaces}},
author = {Saul, Lawrence K. and Singh, Satinder P.},
booktitle = {Annual Conference on Computational Learning Theory},
year = {1995},
pages = {281-288},
doi = {10.1145/225298.225332},
url = {https://mlanthology.org/colt/1995/saul1995colt-markov/}
}