Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs
Abstract
We consider gap-dependent regret bounds for episodic MDPs. We show that the Monotonic Value Propagation (MVP) algorithm (Zhang et al. [2024]) achieves a variance-aware gap-dependent regret bound of $\tilde{O}\left(\left(\sum_{\Delta_h(s,a)>0} \frac{H^2 \log K \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s,a)} +\sum_{\Delta_h(s,a)=0}\frac{ H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_{\mathrm{min}}} + SAH^4 (S \lor H) \right) \log K\right),$ where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\tilde{O}$ hides $\mathsf{poly} \log (S, A, H, 1 / \Delta\_{\mathrm{min}}, 1 / \delta)$ terms. Here, $\Delta_h(s,a) =V_h^* (a) - Q_h^* (s, a)$ represents the suboptimality gap and $\Delta_{\mathrm{min}} := \min_{\Delta_h (s,a) > 0} \Delta_h(s,a)$. The term $\mathtt{Var}\_{\max}^{\textup{c}}$ denotes the maximum conditional total variance, calculated as the maximum over all $(\pi, h, s)$ tuples of the expected total variance under policy $\pi$ conditioned on trajectories visiting state $s$ at step $h$. $\mathtt{Var}\_{\max}^{\textup{c}}$ characterizes the maximum randomness encountered when learning any $(h, s)$ pair. Our result stems from a novel analysis of the weighted sum of the suboptimality gap and can be potentially adapted for other algorithms. To complement the study, we establish a lower bound of $\Omega \left( \sum_{\Delta_h(s,a)>0} \frac{H^2 \land \mathtt{Var}\_{\max}^{\textup{c}}}{\Delta_h(s,a)}\cdot \log K\right),$ demonstrating the necessity of dependence on $\mathtt{Var}\_{\max}^{\textup{c}}$ even when the maximum unconditional total variance (without conditioning on $(h, s)$) approaches zero.
Cite
Text
Chen et al. "Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs." Advances in Neural Information Processing Systems, 2025.Markdown
[Chen et al. "Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/chen2025neurips-sharp/)BibTeX
@inproceedings{chen2025neurips-sharp,
title = {{Sharp Gap-Dependent Variance-Aware Regret Bounds for Tabular MDPs}},
author = {Chen, Shulun and Zhou, Runlong and Zhang, Zihan and Fazel, Maryam and Du, Simon Shaolei},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/chen2025neurips-sharp/}
}