SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Abstract

We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available open-source models, OPT-175B and BLOOM-176B, in under 4.5 hours, and can reach 60% unstructured sparsity with negligible increase in perplexity: remarkably, more than 100 billion weights from these models can be ignored at inference time. SparseGPT generalizes to semi-structured (2:4 and 4:8) patterns, and is compatible with weight quantization approaches. The code is available at: https://github.com/IST-DASLab/sparsegpt.

Cite

Text

Frantar and Alistarh. "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." International Conference on Machine Learning, 2023.

Markdown

[Frantar and Alistarh. "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot." International Conference on Machine Learning, 2023.](https://mlanthology.org/icml/2023/frantar2023icml-sparsegpt/)

BibTeX

@inproceedings{frantar2023icml-sparsegpt,
  title     = {{SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot}},
  author    = {Frantar, Elias and Alistarh, Dan},
  booktitle = {International Conference on Machine Learning},
  year      = {2023},
  pages     = {10323-10337},
  volume    = {202},
  url       = {https://mlanthology.org/icml/2023/frantar2023icml-sparsegpt/}
}