Memory Efficient Adaptive Stochastic Optimization via Subset-Norm
Abstract
As deep neural networks grow larger, memory efficiency becomes crucial, with optimizer states of popular algorithms like Adam consuming substantial memory. This paper generalizes existing high-probability convergence analysis for AdaGrad and AdaGrad-Norm to arbitrary parameter partitions, encompassing both algorithms. We reveal a trade-off between coordinate-noise density and the convergence rate's dimensional dependency, suggesting an optimal grouping between the full coordinate version (AdaGrad) and the scalar version (AdaGrad-Norm). This insight leads to a principled compression approach called \textit{Subset-Norm}, targeting coordinate-wise second moment term in AdaGrad, RMSProp, and Adam. We demonstrate the empirical effectiveness of subset-norm step sizes in LLM pre-training tasks on LLaMA models, showing competitive performance to baselines like Adam while significantly reducing memory usage for the optimizer's state from $O(d)$ to $O(\sqrt{d})$ while introducing no additional hyperparameter.
Cite
Text
Nguyen and Nguyen. "Memory Efficient Adaptive Stochastic Optimization via Subset-Norm." NeurIPS 2024 Workshops: OPT, 2024.Markdown
[Nguyen and Nguyen. "Memory Efficient Adaptive Stochastic Optimization via Subset-Norm." NeurIPS 2024 Workshops: OPT, 2024.](https://mlanthology.org/neuripsw/2024/nguyen2024neuripsw-memory/)BibTeX
@inproceedings{nguyen2024neuripsw-memory,
title = {{Memory Efficient Adaptive Stochastic Optimization via Subset-Norm}},
author = {Nguyen, Thien Hang and Nguyen, Huy},
booktitle = {NeurIPS 2024 Workshops: OPT},
year = {2024},
url = {https://mlanthology.org/neuripsw/2024/nguyen2024neuripsw-memory/}
}