Does Stochastic Gradient Really Succeed for Bandits?

Abstract

Recent works of Mei et al. (2023, 2024) have deepened the theoretical understanding of the *Stochastic Gradient Bandit* (SGB) policy, showing that using a constant learning rate guarantees asymptotic convergence to the optimal policy, and that sufficiently *small* learning rates can yield logarithmic regret. However, whether logarithmic regret holds beyond small learning rates remains unclear. In this work, we take a step towards characterizing the regret *regimes* of SGB as a function of its learning rate. For two--armed bandits, we identify a sharp threshold, scaling with the sub-optimality gap $\Delta$, below which SGB achieves *logarithmic* regret on all instances, and above which it can incur *polynomial* regret on some instances. This result highlights the necessity of knowing (or estimating) $\Delta$ to ensure logarithmic regret with a constant learning rate. For general $K$-armed bandits, we further show the learning rate must scale inversely with $K$ to avoid polynomial regret. We introduce novel techniques to derive regret upper bounds for SGB, laying the groundwork for future advances in the theory of gradient-based bandit algorithms.

Cite

Text

Baudry et al. "Does Stochastic Gradient Really Succeed for Bandits?." Advances in Neural Information Processing Systems, 2025.

Markdown

[Baudry et al. "Does Stochastic Gradient Really Succeed for Bandits?." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/baudry2025neurips-stochastic/)

BibTeX

@inproceedings{baudry2025neurips-stochastic,
  title     = {{Does Stochastic Gradient Really Succeed for Bandits?}},
  author    = {Baudry, Dorian and Johnson, Emmeran and Vary, Simon and Pike-Burke, Ciara and Rebeschini, Patrick},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/baudry2025neurips-stochastic/}
}