GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Abstract

Large language model (LLM) training and finetuning are often severely constrained by limited GPU memory. While parameter-efficient finetuning techniques like LoRA address this by learning low-rank weight updates, they frequently underperform compared to full-rank training, especially during pretraining. We propose GRASS (GRAdient Stuctured Sparsification), a novel approach that slashes LLM training memory and compute requirements without compromising performance. GRASS leverages sparse projections to transform gradients into structurally sparse gradients, significantly lowering memory usage for both optimizer states and gradient communication. This compression, in turn, unlocks substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that GRASS achieves comparable performance to existing projection-based optimizers and full-rank training. Notably, GRASS enables pretraining a 13B parameter LLaMA model on a single 40GB A100 GPU---a feat infeasible for previous methods---and yields up to a $2\times$ throughput improvement on an 8-GPU system.

Cite

Text

Muhamed et al. "GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Muhamed et al. "GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/muhamed2024icmlw-grass/)

BibTeX

@inproceedings{muhamed2024icmlw-grass,
  title     = {{GRASS: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients}},
  author    = {Muhamed, Aashiq and Li, Oscar and Woodruff, David and Diab, Mona T. and Smith, Virginia},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/muhamed2024icmlw-grass/}
}