La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

ICML 2025 pp. 39968-39986

Abstract

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30$\times$ wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%.

Cite

Text

Liu et al. "La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Liu et al. "La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/liu2025icml-la/)

BibTeX

@inproceedings{liu2025icml-la,
  title     = {{La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation}},
  author    = {Liu, Kai and Xu, Bowen and Wu, Shaoyu and Chen, Xin and Zhou, Hao and Tao, Yongliang and Hu, Lulu},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {39968-39986},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/liu2025icml-la/}
}