A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes
Abstract
In light of the Bellman duality, we propose a novel value-policy gradient algorithm to explore and act in infinite-horizon Average-reward Markov Decision Process (AMDP) and show that it has sublinear regret. The algorithm is motivated by the Bellman saddle point formulation. It learns the optimal state-action distribution, which encodes a randomized policy, by interacting with the environment along a single trajectory and making primal-dual updates. The key to the analysis is to establish a connection between the min-max duality gap of Bellman saddle point and the cumulative regret of the learning agent. We show that, for ergodic AMDPs with finite state space $\mathcal{S}$ and action space $\mathcal{A}$ and uniformly bounded mixing times, the algorithm’s $T$-time step regret is $ R(T)=\tilde{\mathcal{O}}\left( \left(t_{mix}^*\right)^2 \tau^{\frac{3}{2}} \sqrt{(\tau^3 + |\mathcal{A}|) |\mathcal{S}| T} \right), $ where $t_{mix}^*$ is the worst-case mixing time, $\tau$ is an ergodicity parameter, $T$ is the number of time steps and $\tilde{\mathcal{O}}$ hides polylog factors.
Cite
Text
Gong and Wang. "A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes." Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.Markdown
[Gong and Wang. "A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes." Proceedings of the 2nd Conference on Learning for Dynamics and Control, 2020.](https://mlanthology.org/l4dc/2020/gong2020l4dc-duality/)BibTeX
@inproceedings{gong2020l4dc-duality,
title = {{A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes}},
author = {Gong, Hao and Wang, Mengdi},
booktitle = {Proceedings of the 2nd Conference on Learning for Dynamics and Control},
year = {2020},
pages = {862-883},
volume = {120},
url = {https://mlanthology.org/l4dc/2020/gong2020l4dc-duality/}
}