Fast Convergence of SoftMax Policy Mirror Ascent for Bandits & Tabular MDPs

Abstract

We analyze the convergence of a novel policy gradient algorithm (referred to as SPMA) for multi-armed bandits and tabular Markov decision processes (MDPs). SPMA is an instantiation of mirror ascent and uses the softmax parameterization with a log-sum-exp mirror map. Given access to the exact policy gradients, we prove that SPMA with a constant step-size requires $O(\log(1/\epsilon))$ iterations to converge to an $\epsilon$-optimal policy. The resulting convergence rate is better than both the $\Theta\left(\frac{1}{\epsilon}\right)$ rate for constant step-size softmax policy gradient (SPG) and the $O\left(\frac{1}{\sqrt{\epsilon}}\right)$ rate for SPG with Nesterov acceleration. Furthermore, unlike the SPG results, the convergence rate for SPMA does not depend on potentially large problem-dependent constants, and matches the rate achieved by natural policy gradient (NPG). Furthermore, for multi-armed bandits, we prove that SPMA with gap-dependent step-sizes can result in global super-linear convergence. Our experimental evaluations on tabular MDPs and continuous control tasks demonstrate that SPMA consistently outperforms SPG while achieving similar or better performance compared to NPG.

Cite

Text

Asad et al. "Fast Convergence of SoftMax Policy Mirror Ascent for Bandits & Tabular MDPs." NeurIPS 2024 Workshops: OPT, 2024.

Markdown

[Asad et al. "Fast Convergence of SoftMax Policy Mirror Ascent for Bandits & Tabular MDPs." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/asad2024neuripsw-fast/)

BibTeX

@inproceedings{asad2024neuripsw-fast,
  title     = {{Fast Convergence of SoftMax Policy Mirror Ascent for Bandits & Tabular MDPs}},
  author    = {Asad, Reza and Harikandeh, Reza Babanezhad and Laradji, Issam H. and Le Roux, Nicolas and Vaswani, Sharan},
  booktitle = {NeurIPS 2024 Workshops: OPT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/asad2024neuripsw-fast/}
}