On Constant Regret for Low-Rank MDPs

Abstract

Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.

Cite

Text

Sturm and Tschiatschek. "On Constant Regret for Low-Rank MDPs." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.

Markdown

[Sturm and Tschiatschek. "On Constant Regret for Low-Rank MDPs." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.](https://mlanthology.org/uai/2025/sturm2025uai-constant/)

BibTeX

@inproceedings{sturm2025uai-constant,
  title     = {{On Constant Regret for Low-Rank MDPs}},
  author    = {Sturm, Alexander and Tschiatschek, Sebastian},
  booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
  year      = {2025},
  pages     = {4044-4079},
  volume    = {286},
  url       = {https://mlanthology.org/uai/2025/sturm2025uai-constant/}
}