Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs

Abstract

We present two Policy Gradient-based algorithms with general parametrization in the context of infinite-horizon average reward Markov Decision Process (MDP). The first one employs Implicit Gradient Transport for variance reduction, ensuring an expected regret of the order $\tilde{\mathcal{O}}(T^{2/3})$. The second approach, rooted in Hessian-based techniques, ensures an expected regret of the order $\tilde{\mathcal{O}}(\sqrt{T})$. These results significantly improve the state-of-the-art $\tilde{\mathcal{O}}(T^{3/4})$ regret and achieve the theoretical lower bound. We also show that the average-reward function is approximately $L$-smooth, a result that was previously assumed in earlier works.

Cite

Text

Ganesh et al. "Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.

Markdown

[Ganesh et al. "Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.](https://mlanthology.org/aistats/2025/ganesh2025aistats-orderoptimal/)

BibTeX

@inproceedings{ganesh2025aistats-orderoptimal,
  title     = {{Order-Optimal Regret with Novel Policy Gradient Approaches in Infinite-Horizon Average Reward MDPs}},
  author    = {Ganesh, Swetha and Mondal, Washim Uddin and Aggarwal, Vaneet},
  booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  year      = {2025},
  pages     = {3421-3429},
  volume    = {258},
  url       = {https://mlanthology.org/aistats/2025/ganesh2025aistats-orderoptimal/}
}