On Constant Regret for Low-Rank MDPs
Abstract
Although there exist instance-dependent regret bounds for linear Markov decision processes (MDPs) and low-rank bandits, extensions to low-rank MDPs remain unexplored. In this work, we close this gap and provide regret bounds for low-rank MDPs in an instance-dependent setting. Specifically, we introduce an algorithm, called UniSREP-UCB, which utilizes a constrained optimization objective to learn features with good spectral properties. Furthermore, we demonstrate that our algorithm enjoys constant regret if the minimal sub-optimality gap and the occupancy distribution of the optimal policy are well-defined and known. To the best of our knowledge, these are the first instance-dependent regret results for low-rank MDPs.
Cite
Text
Sturm and Tschiatschek. "On Constant Regret for Low-Rank MDPs." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.Markdown
[Sturm and Tschiatschek. "On Constant Regret for Low-Rank MDPs." Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, 2025.](https://mlanthology.org/uai/2025/sturm2025uai-constant/)BibTeX
@inproceedings{sturm2025uai-constant,
title = {{On Constant Regret for Low-Rank MDPs}},
author = {Sturm, Alexander and Tschiatschek, Sebastian},
booktitle = {Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence},
year = {2025},
pages = {4044-4079},
volume = {286},
url = {https://mlanthology.org/uai/2025/sturm2025uai-constant/}
}