Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Abstract

We consider \emph{gradient descent} (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly — in $O(\eta)$ steps, and subsequently achieves an $\tilde{O}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an \emph{accelerated} loss of $\tilde{O}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{O}(1/T^2)$ acceleration), nonlinear predictors in the \emph{neural tangent kernel} regime, and online \emph{stochastic gradient descent} (SGD) with a large stepsize, under suitable separability conditions.

Cite

Text

Wu et al. "Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency." Conference on Learning Theory, 2024.

Markdown

[Wu et al. "Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency." Conference on Learning Theory, 2024.](https://mlanthology.org/colt/2024/wu2024colt-large/)

BibTeX

@inproceedings{wu2024colt-large,
  title     = {{Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency}},
  author    = {Wu, Jingfeng and Bartlett, Peter L. and Telgarsky, Matus and Yu, Bin},
  booktitle = {Conference on Learning Theory},
  year      = {2024},
  pages     = {5019-5073},
  volume    = {247},
  url       = {https://mlanthology.org/colt/2024/wu2024colt-large/}
}