Connections Between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging

Morwani, Depen; Vyas, Nikhil; Zhang, Hanlin; Kakade, Sham M.

Connections Between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging

Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham M. Kakade

NeurIPSW 2024

/neuripsw/2024/morwani2024neuripsw-connections/

Abstract

In this work, we uncover precise connections between the recently proposed optimizers such as schedule-free SGD, Lion and the literature on accelerated SGD variants. We show that schedule-free SGD can be precisely understood as accelerated SGD combined with weight averaging. The primary idea behind all these optimizers is decoupling the momentum coefficient from the weight on the gradient in the current step. We provide experimental results on a 150m decoder-only language model supporting our claims by demonstrating that ScheduleFreeAdamW is close in performance to Adam combined with accelerated SGD and weight averaging.

PDF NeurIPSW OpenReview Semantic Scholar

Cite

Text

Morwani et al. "Connections Between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging." NeurIPS 2024 Workshops: OPT, 2024.

Markdown

[Morwani et al. "Connections Between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/morwani2024neuripsw-connections/)

BibTeX

@inproceedings{morwani2024neuripsw-connections,
  title     = {{Connections Between Schedule-Free SGD, Accelerated SGD Variants, and Weight Averaging}},
  author    = {Morwani, Depen and Vyas, Nikhil and Zhang, Hanlin and Kakade, Sham M.},
  booktitle = {NeurIPS 2024 Workshops: OPT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/morwani2024neuripsw-connections/}
}