Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Abstract

We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization, rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. Minimizing these upper-bounds requires solving implicit equations to obtain a sequence of strongly adapted step-sizes; we show that these equations are straightforward to solve for convex quadratics and lead to new guarantees for two classical step-sizes. For general functions, we prove that the Polyak step-size and normalized GD obtain fast, path-dependent rates despite using no knowledge of the directional smoothness. Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on $L$-smoothness.

Cite

Text

Mishkin et al. "Directional Smoothness and Gradient Methods: Convergence and Adaptivity." Neural Information Processing Systems, 2024. doi:10.52202/079017-0473

Markdown

[Mishkin et al. "Directional Smoothness and Gradient Methods: Convergence and Adaptivity." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/mishkin2024neurips-directional/) doi:10.52202/079017-0473

BibTeX

@inproceedings{mishkin2024neurips-directional,
  title     = {{Directional Smoothness and Gradient Methods: Convergence and Adaptivity}},
  author    = {Mishkin, Aaron and Khaled, Ahmed and Wang, Yuanhao and Defazio, Aaron and Gower, Robert M.},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0473},
  url       = {https://mlanthology.org/neurips/2024/mishkin2024neurips-directional/}
}