Training-Free Activation Sparsity in Large Language Models

Abstract

Activation sparsity can enable practical inference speedups in large language models (LLMs) by reducing the compute and memory-movement required for matrix multiplications during the forward pass. However, existing methods face limitations that inhibit widespread adoption. Some approaches are tailored towards older models with ReLU-based sparsity, while others require extensive continued pre-training on up to hundreds of billions of tokens. This paper describes TEAL (**T**raining-Fre**e** **A**ctivation Sparsity in **L**LMs), a simple training-free method that applies magnitude-based activation sparsity to hidden states throughout the entire model. TEAL achieves 40-50\% model-wide sparsity with minimal performance degradation across Llama-2, Llama-3, and Mistral families, with sizes varying from 7B to 70B. We improve existing sparse kernels and demonstrate wall-clock decoding speed-ups of up to 1.53× and 1.8× at 40\% and 50\% model-wide sparsity. TEAL is compatible with weight quantization, enabling further efficiency gains.

Cite

Text

Liu et al. "Training-Free Activation Sparsity in Large Language Models." International Conference on Learning Representations, 2025.

Markdown

[Liu et al. "Training-Free Activation Sparsity in Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/liu2025iclr-trainingfree/)

BibTeX

@inproceedings{liu2025iclr-trainingfree,
  title     = {{Training-Free Activation Sparsity in Large Language Models}},
  author    = {Liu, James and Ponnusamy, Pragaash and Cai, Tianle and Guo, Han and Kim, Yoon and Athiwaratkun, Ben},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/liu2025iclr-trainingfree/}
}