Sign-SGD via Parameter-Free Optimization

Abstract

Large language models have achieved major advances across domains, yet training them remains extremely resource-intensive. We revisit Sign-SGD, which serves both as a memory-efficient optimizer for single-node training and as a gradient compression mechanism for distributed learning. This paper addresses a central limitation: the effective stepsize cannot be determined a priori because it relies on unknown, problem-specific quantities. We present a parameter-free Sign-SGD that removes manual stepsize selection. We analyze the deterministic single-node case, and extend the method to stochastic single-node training and multi-node settings. We also incorporate the momentum technique into our algorithms and propose a memory-efficient variant that stores only gradient signs instead of full gradients. We evaluate our methods on pre-training LLaMA models (130M and 350M) and fine-tuning a Swin Transformer (28M). Across considered tasks, the proposed methods match the performance of tuned Sign-SGD and AdamW (grid-searched stepsizes with a cosine schedule), while avoiding tuning overhead. Employing parameter-free training yields approximately $1.5\times$ end-to-end speedup compared to runs with grid-searched stepsizes.

Cite

Text

Medyakov et al. "Sign-SGD via Parameter-Free Optimization." International Conference on Learning Representations, 2026.

Markdown

[Medyakov et al. "Sign-SGD via Parameter-Free Optimization." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/medyakov2026iclr-signsgd/)

BibTeX

@inproceedings{medyakov2026iclr-signsgd,
  title     = {{Sign-SGD via Parameter-Free Optimization}},
  author    = {Medyakov, Daniil and Sergey, Stanko and Molodtsov, Gleb and Zmushko, Philip and Grigoriy, Evseev and Petrov, Egor and Beznosikov, Aleksandr},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/medyakov2026iclr-signsgd/}
}