Near-Optimal Sample Complexity for MDPs via Anchoring

Abstract

We study a new model-free algorithm to compute $\varepsilon$-optimal policies for average reward Markov decision processes, in the weakly communicating setting. Given a generative model, our procedure combines a recursive sampling technique with Halpern’s anchored iteration, and computes an $\varepsilon$-optimal policy with sample and time complexity $\widetilde{O}(|\mathcal{S}||\mathcal{A}|\||h\||^{2}/\varepsilon^{2})$ both in high probability and in expectation. To our knowledge, this is the best complexity among model-free algorithms, matching the known lower bound up to a factor $ \||h\|| $. Although the complexity bound involves the span seminorm $ \||h\|| $ of the unknown bias vector, the algorithm requires no prior knowledge and implements a stopping rule which guarantees with probability 1 that the procedure terminates in finite time. We also analyze how these techniques can be adapted for discounted MDPs.

Cite

Text

Lee et al. "Near-Optimal Sample Complexity for MDPs via Anchoring." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Lee et al. "Near-Optimal Sample Complexity for MDPs via Anchoring." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/lee2025icml-nearoptimal/)

BibTeX

@inproceedings{lee2025icml-nearoptimal,
  title     = {{Near-Optimal Sample Complexity for MDPs via Anchoring}},
  author    = {Lee, Jongmin and Bravo, Mario and Cominetti, Roberto},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {32907-32929},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/lee2025icml-nearoptimal/}
}