Prompt-Based Depth Pruning of Large Language Models

Abstract

Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent—a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.

Cite

Text

Wee et al. "Prompt-Based Depth Pruning of Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Wee et al. "Prompt-Based Depth Pruning of Large Language Models." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wee2025icml-promptbased/)

BibTeX

@inproceedings{wee2025icml-promptbased,
  title     = {{Prompt-Based Depth Pruning of Large Language Models}},
  author    = {Wee, Juyun and Park, Minjae and Lee, Jaeho},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {65936-65948},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/wee2025icml-promptbased/}
}