QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Abstract
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
Cite
Text
Xu et al. "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models." International Conference on Learning Representations, 2024.Markdown
[Xu et al. "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/xu2024iclr-qalora/)BibTeX
@inproceedings{xu2024iclr-qalora,
title = {{QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models}},
author = {Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhengsu and Zhang, Xiaopeng and Tian, Qi},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/xu2024iclr-qalora/}
}