Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

Bowen Ping, Shuo Wang, Hanqing Wang, Xu Han, Yuzhuang Xu, Yukun Yan, Yun Chen, Baobao Chang, Zhiyuan Liu, Maosong Sun

NeurIPS 2024

doi:10.52202/079017-0978 /neurips/2024/ping2024neurips-deltacome/

Abstract

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Ping et al. "Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-0978

Markdown

[Ping et al. "Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/ping2024neurips-deltacome/) doi:10.52202/079017-0978

BibTeX

@inproceedings{ping2024neurips-deltacome,
  title     = {{Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models}},
  author    = {Ping, Bowen and Wang, Shuo and Wang, Hanqing and Han, Xu and Xu, Yuzhuang and Yan, Yukun and Chen, Yun and Chang, Baobao and Liu, Zhiyuan and Sun, Maosong},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0978},
  url       = {https://mlanthology.org/neurips/2024/ping2024neurips-deltacome/}
}