Accelerating Transformer Inference and Training with 2:4 Activation Sparsity
Abstract
In this paper, we demonstrate how to apply 2:4 sparsity, a hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuraccy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. We also discuss the benefits of combining 2:4 sparsity with fp8 quantization to maximize efficiency gains. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.
Cite
Text
Haziza et al. "Accelerating Transformer Inference and Training with 2:4 Activation Sparsity." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Haziza et al. "Accelerating Transformer Inference and Training with 2:4 Activation Sparsity." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/haziza2025iclrw-accelerating/)BibTeX
@inproceedings{haziza2025iclrw-accelerating,
title = {{Accelerating Transformer Inference and Training with 2:4 Activation Sparsity}},
author = {Haziza, Daniel and Chou, Timothy and Choudhary, Dhruv and Cai, Jesse and Wehrstedt, Luca and Massa, Francisco and Yu, Jiecao and Jeong, Geonhwa and Rao, Supriya and Labatut, Patrick},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/haziza2025iclrw-accelerating/}
}