Low Rank Quantization-Aware Training for LLMs

Abstract

In this paper we propose LR-QAT – a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing performance: (a) low-rank quantization-aware reparameterization; (b) downcasting operation using fixed-point or double-packing and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pre-training framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) is orthogonal to most of recent PTQ methods and thus can be seamlessly combined with them. We apply LR-QAT to the LLaMA-1/2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms most of recent LLM quantization approaches and reaches the same model performance as full model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB memory.

Cite

Text

Bondarenko et al. "Low Rank Quantization-Aware Training for LLMs." ICML 2024 Workshops: ES-FoMo-II, 2024.

Markdown

[Bondarenko et al. "Low Rank Quantization-Aware Training for LLMs." ICML 2024 Workshops: ES-FoMo-II, 2024.](https://mlanthology.org/icmlw/2024/bondarenko2024icmlw-low/)

BibTeX

@inproceedings{bondarenko2024icmlw-low,
  title     = {{Low Rank Quantization-Aware Training for LLMs}},
  author    = {Bondarenko, Yelysei and Del Chiaro, Riccardo and Nagel, Markus},
  booktitle = {ICML 2024 Workshops: ES-FoMo-II},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/bondarenko2024icmlw-low/}
}