Near-Optimal Sample Complexity for MDPs via Anchoring
Abstract
We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern’s anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\||h\||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ \||h\|| $. Although the complexity bound involves the span seminorm $ \||h\|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.
Cite
Text
Lee et al. "Near-Optimal Sample Complexity for MDPs via Anchoring." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Lee et al. "Near-Optimal Sample Complexity for MDPs via Anchoring." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/lee2025icml-nearoptimal/)BibTeX
@inproceedings{lee2025icml-nearoptimal,
title = {{Near-Optimal Sample Complexity for MDPs via Anchoring}},
author = {Lee, Jongmin and Bravo, Mario and Cominetti, Roberto},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {32907-32929},
volume = {267},
url = {https://mlanthology.org/icml/2025/lee2025icml-nearoptimal/}
}