Discovering Sparsity Allocation for Layer-Wise Pruning of Large Language Models

Abstract

In this paper, we present DSA, the first automated framework for discovering sparsity allocation schemes for layer-wise pruning in Large Language Models (LLMs). LLMs have become increasingly powerful, but their large parameter counts make them computationally expensive. Existing pruning methods for compressing LLMs primarily focus on evaluating redundancies and removing element-wise weights. However, these methods fail to allocate adaptive layer-wise sparsities, leading to performance degradation in challenging tasks. We observe that per-layer importance statistics can serve as allocation indications, but their effectiveness depends on the allocation function between layers. To address this issue, we develop an expression discovery framework to explore potential allocation strategies. Our allocation functions involve two steps: reducing element-wise metrics to per-layer importance scores, and modelling layer importance to sparsity ratios. To search for the most effective allocation function, we construct a search space consisting of pre-process, reduction, transform, and post-process operations. We leverage an evolutionary algorithm to perform crossover and mutation on superior candidates within the population, guided by performance evaluation. Finally, we seamlessly integrate our discovered functions into various uniform methods, resulting in significant performance improvements. We conduct extensive experiments on multiple challenging tasks such as arithmetic, knowledge reasoning, and multimodal benchmarks spanning GSM8K, MMLU, SQA, and VQA, demonstrating that our DSA method achieves significant performance gains on the LLaMA-1|2|3, Mistral, and OPT models. Notably, the LLaMA-1|2|3 model pruned by our DSA reaches 4.73\%|6.18\%|10.65\% gain over the state-of-the-art techniques (e.g., Wanda and SparseGPT).

Cite

Text

Li et al. "Discovering Sparsity Allocation for  Layer-Wise Pruning of Large Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-4487

Markdown

[Li et al. "Discovering Sparsity Allocation for  Layer-Wise Pruning of Large Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/li2024neurips-discovering/) doi:10.52202/079017-4487

BibTeX

@inproceedings{li2024neurips-discovering,
  title     = {{Discovering Sparsity Allocation for  Layer-Wise Pruning of Large Language Models}},
  author    = {Li, Lujun and Dong, Peijie and Tang, Zhenheng and Liu, Xiang and Wang, Qiang and Luo, Wenhan and Xue, Wei and Liu, Qifeng and Chu, Xiaowen and Guo, Yike},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4487},
  url       = {https://mlanthology.org/neurips/2024/li2024neurips-discovering/}
}