Memory Efficient Adaptive Stochastic Optimization via Subset-Norm

Abstract

As deep neural networks grow larger, memory efficiency becomes crucial, with optimizer states of popular algorithms like Adam consuming substantial memory. This paper generalizes existing high-probability convergence analysis for AdaGrad and AdaGrad-Norm to arbitrary parameter partitions, encompassing both algorithms. We reveal a trade-off between coordinate-noise density and the convergence rate's dimensional dependency, suggesting an optimal grouping between the full coordinate version (AdaGrad) and the scalar version (AdaGrad-Norm). This insight leads to a principled compression approach called \textit{Subset-Norm}, targeting coordinate-wise second moment term in AdaGrad, RMSProp, and Adam. We demonstrate the empirical effectiveness of subset-norm step sizes in LLM pre-training tasks on LLaMA models, showing competitive performance to baselines like Adam while significantly reducing memory usage for the optimizer's state from $O(d)$ to $O(\sqrt{d})$ while introducing no additional hyperparameter.

Cite

Text

Nguyen and Nguyen. "Memory Efficient Adaptive Stochastic Optimization via Subset-Norm." NeurIPS 2024 Workshops: OPT, 2024.

Markdown

[Nguyen and Nguyen. "Memory Efficient Adaptive Stochastic Optimization via Subset-Norm." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/nguyen2024neuripsw-memory/)

BibTeX

@inproceedings{nguyen2024neuripsw-memory,
  title     = {{Memory Efficient Adaptive Stochastic Optimization via Subset-Norm}},
  author    = {Nguyen, Thien Hang and Nguyen, Huy},
  booktitle = {NeurIPS 2024 Workshops: OPT},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/nguyen2024neuripsw-memory/}
}