Low-Rank Is Required for Pruning LLMs

Abstract

Post-train pruning without fine-tuning has emerged as an efficient method for compressing large language models for inference, offering a computationally cheaper alternative to other approaches. However, recent studies have revealed that, unlike quantization, pruning consistently degrades model performance as sparsity increases. We demonstrate that this degradation results from pruning’s inability to preserve a low-rank structure in the model's weights, which is crucial for maintaining attention sinks. Furthermore, we show that these attention sinks play a key role in enabling the model to segment sequences—an essential mechanism for effective few-shot learning.

Cite

Text

Zhang and Papyan. "Low-Rank Is Required for Pruning LLMs." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Zhang and Papyan. "Low-Rank Is Required for Pruning LLMs." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/zhang2025iclrw-lowrank/)

BibTeX

@inproceedings{zhang2025iclrw-lowrank,
  title     = {{Low-Rank Is Required for Pruning LLMs}},
  author    = {Zhang, Stephen and Papyan, Vardan},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/zhang2025iclrw-lowrank/}
}