Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning
Abstract
We study reinforcement learning in non-stationary communicating MDPs whose transition drift admits a low-rank plus sparse structure. We propose \textbf{SVUCRL} (Structured Variation UCRL) and prove the dynamic-regret bound \[\scalebox{0.75}{$\displaystyle \widetilde{\mathcal O}\Bigl( \sqrt{SA\,T} + D_{\max}S\sqrt{A\,T} + L_{\max}B_r + D_{\max}L_{\max}B_p + D_{\max}S\sqrt{A\,B_p} + D_{\max}\delta_B B_p + D_{\max}\sqrt{K\,T} \Bigr), $}\] (up to the additional planning-tolerance term $\sum_{t=1}^T \varepsilon_{\tau(m(t))}$). where $S$ is the number of states, $A$ the number of actions, $T$ the horizon, $D_{\max}$ the MDP diameter, $B_r$/$B_p$ the total reward/transition variation budgets, and $K\!\ll\! SA$ the rank of the structured drift, $L_{\max}$ is the maximum episole length. The first two terms are the statistical price of learning in stationary problems. The structure-dependent non-stationarity contribution appears through $D_{\max}\sqrt{K\,T}$ (low-rank drift) and $D_{\max}\delta_B B_p$ (sparse shocks), which scale with $\sqrt{K}$ rather than $\sqrt{SA}$ when drift is low-rank. This matches the $\sqrt{T}$ rate (up to logs) and improves on prior $T^{3/4}$-type guarantees. SVUCRL combines: (i) online low-rank tracking with explicit Frobenius guarantees, (ii) incremental RPCA to separate structured drift from sparse shocks, (iii) adaptive confidence widening via a bias-corrected local-variation estimator, and (iv) factor forecasting with an optimal shrinkage center.
Cite
Text
Dam. "Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.Markdown
[Dam. "Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/dam2025neurips-optimal/)BibTeX
@inproceedings{dam2025neurips-optimal,
title = {{Optimal Regret Bounds via Low-Rank Structured Variation in Non-Stationary Reinforcement Learning}},
author = {Dam, Tuan Quang},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/dam2025neurips-optimal/}
}