Probe Pruning: Accelerating LLMs Through Dynamic Pruning via Model-Probing

Abstract

We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing—using just 1.5% of FLOPs—can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of latency reduction compared to the state-of-the-art method at a 40\% pruning ratio.

Cite

Text

Le et al. "Probe Pruning: Accelerating LLMs Through Dynamic Pruning via Model-Probing." International Conference on Learning Representations, 2025.

Markdown

[Le et al. "Probe Pruning: Accelerating LLMs Through Dynamic Pruning via Model-Probing." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/le2025iclr-probe/)

BibTeX

@inproceedings{le2025iclr-probe,
  title     = {{Probe Pruning: Accelerating LLMs Through Dynamic Pruning via Model-Probing}},
  author    = {Le, Qi and Diao, Enmao and Wang, Ziyan and Wang, Xinran and Ding, Jie and Yang, Li and Anwar, Ali},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/le2025iclr-probe/}
}