Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes

Abstract

Fine-tuning language models (LMs) with first-order optimizers often demands excessive memory, limiting accessibility, while zeroth-order optimizers use less memory, but suffer from slow convergence depending on model size. We introduce a novel method named Addax that integrates the recently introduced Memory-Efficient Zeroth-order Optimizer of Malladi et al. (2023) with Stochastic Gradient Descent (SGD). Addax obtains zeroth-order and first-order gradient estimates and optimally combines them as the descent direction in each step. The first-order updates are performed "in-place" to further save memory. Theoretically, we establish the convergence of Addax under mild assumptions, demonstrating less restrictive hyper-parameters and independence from model size. Our extensive experiments with diverse LMs and tasks show that Addax consistently outperforms zero-shot and MeZO in terms of accuracy. Moreover, Addax surpasses the performance of standard fine-tuning approaches, such as SGD and Adam, in specific scenarios with significantly less memory requirement.

Cite

Text

Li et al. "Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes." ICLR 2024 Workshops: PML4LRS, 2024.

Markdown

[Li et al. "Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes." ICLR 2024 Workshops: PML4LRS, 2024.](https://mlanthology.org/iclrw/2024/li2024iclrw-addax/)

BibTeX

@inproceedings{li2024iclrw-addax,
  title     = {{Addax: Memory-Efficient Fine-Tuning of Language Models with a Combination of Forward-Backward and Forward-Only Passes}},
  author    = {Li, Zeman and Zhang, Xinwei and Razaviyayn, Meisam},
  booktitle = {ICLR 2024 Workshops: PML4LRS},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/li2024iclrw-addax/}
}