WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

Abstract

The ever-increasing computational demands of large language models (LLMs) make efficient inference a central challenge. While recent advances leverage specialized architectures or selective activation, they typically require (re)training or architectural modifications, limiting their broad applicability. Training-free sparse activation, in contrast, offers a plug-and-play pathway to efficiency; however, existing methods often rely solely on hidden state magnitudes, leading to significant approximation error and performance degradation. To address this, we introduce WINA (Weight-Informed Neuron Activation): a simple framework for training-free sparse activation that incorporates both hidden state magnitudes and weight matrix structure. By also leveraging the ℓ2-norm of the model’s weight matrices, WINA yields a principled sparsification strategy with provably optimal approximation error bounds, offering better and tighter theoretical guarantees than prior state-of-the-art approaches. Overall, WINA also empirically outperforms many previous training-free methods across diverse LLM architectures and datasets: not only matching or exceeding their accuracy at comparable sparsity levels, but also sustaining performance better at more extreme sparsity levels. Together, these results position WINA as a practical, theoretically grounded, and broadly deployable solution for efficient inference. Our source code is available at https://github.com/microsoft/wina.

Cite

Text

Chen et al. "WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference." International Conference on Learning Representations, 2026.

Markdown

[Chen et al. "WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/chen2026iclr-wina/)

BibTeX

@inproceedings{chen2026iclr-wina,
  title     = {{WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference}},
  author    = {Chen, Sihan and Zhao, Dan and Ko, Jongwoo and Banbury, Colby and Zhuang, Huiping and Liang, Luming and Cameron, Pashmina and Chen, Tianyi},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/chen2026iclr-wina/}
}