High-Probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent Under Heavy-Tailed Noise

Armacki, Aleksandar; Yu, Shuhua; Sharma, Pranay; Joshi, Gauri; Bajovic, Dragana; Jakovetic, Dusan; Kar, Soummya

High-Probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent Under Heavy-Tailed Noise

Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

AISTATS 2025 pp. 1774-1782

/aistats/2025/armacki2025aistats-highprobability/

Abstract

We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate $\widetilde{\mathcal{O}}(t^{-1/4})$, while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate $\mathcal{O}(t^{-\zeta})$, where $\zeta \in (0,1)$ depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a neighbourhood of stationarity, whose size depends on the mixture coefficient, nonlinearity and noise. Compared to state-of-the-art, who only consider clipping and require unbiased noise with bounded $p$-th moments, $p \in (1,2]$, we provide guarantees for a broad class of nonlinearities, without any assumptions on noise moments. While the rate exponents in state-of-the-art depend on noise moments and vanish as $p \rightarrow 1$, our exponents are constant and strictly better whenever $p < 6/5$ for non-convex and $p < 8/7$ for strongly convex costs. Experiments validate our theory, showing that clipping is not always the optimal nonlinearity, further underlining the value of a general framework.

PDF AISTATS OpenReview Semantic Scholar

Cite

Text

Armacki et al. "High-Probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent Under Heavy-Tailed Noise." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.

Markdown

[Armacki et al. "High-Probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent Under Heavy-Tailed Noise." Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, 2025.](https://mlanthology.org/aistats/2025/armacki2025aistats-highprobability/)

BibTeX

@inproceedings{armacki2025aistats-highprobability,
  title     = {{High-Probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent Under Heavy-Tailed Noise}},
  author    = {Armacki, Aleksandar and Yu, Shuhua and Sharma, Pranay and Joshi, Gauri and Bajovic, Dragana and Jakovetic, Dusan and Kar, Soummya},
  booktitle = {Proceedings of The 28th International Conference on Artificial Intelligence and Statistics},
  year      = {2025},
  pages     = {1774-1782},
  volume    = {258},
  url       = {https://mlanthology.org/aistats/2025/armacki2025aistats-highprobability/}
}