Adam-Mini: Use Fewer Learning Rates to Gain More

Abstract

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with $45\%$ to $50\%$ less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq 90\%$ of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves $49.6\%$ higher throughput than AdamW when pre-training Llama2-7B on $2\times$ A800-80GB GPUs, which saves 33\% wall-clock time for pre-training.

Cite

Text

Zhang et al. "Adam-Mini: Use Fewer Learning Rates to Gain More." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Zhang et al. "Adam-Mini: Use Fewer Learning Rates to Gain More." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/zhang2024icmlw-adammini/)

BibTeX

@inproceedings{zhang2024icmlw-adammini,
  title     = {{Adam-Mini: Use Fewer Learning Rates to Gain More}},
  author    = {Zhang, Yushun and Chen, Congliang and Li, Ziniu and Ding, Tian and Wu, Chenwei and Ye, Yinyu and Luo, Zhi-Quan and Sun, Ruoyu},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/zhang2024icmlw-adammini/}
}