LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Abstract
Quantization is an indispensable technique for serving Large Language Models (LLMs) and has recently found its way into LoRA fine-tuning (Dettmers et al., 2023). In this work we focus on the scenario where quantization and LoRA fine- tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrep- ancy between the quantized and full-precision model and significantly improves the generalization in downstream tasks. We evaluate our method on natural lan- guage understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and out- performs existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. We will release our code.
Cite
Text
Li et al. "LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models." International Conference on Learning Representations, 2024.Markdown
[Li et al. "LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models." International Conference on Learning Representations, 2024.](https://mlanthology.org/iclr/2024/li2024iclr-loftq/)BibTeX
@inproceedings{li2024iclr-loftq,
title = {{LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models}},
author = {Li, Yixiao and Yu, Yifan and Liang, Chen and Karampatziakis, Nikos and He, Pengcheng and Chen, Weizhu and Zhao, Tuo},
booktitle = {International Conference on Learning Representations},
year = {2024},
url = {https://mlanthology.org/iclr/2024/li2024iclr-loftq/}
}