Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front
Abstract
Linear recurrent neural networks enable powerful long-range sequence modeling with constant memory usage and time-per-token during inference. These architectures hold promise for streaming applications at the edge, but deployment in resource-constrained environments requires hardware-aware optimizations to minimize latency and energy consumption. In this paper, we investigate the effectiveness of unstructured sparsity--both in weights and activations--at reducing the computational demand of linear RNNs, as well as its combination with quantization. We find that highly sparse linear RNNs consistently achieve better efficiency-performance trade-offs than dense baselines, with $2\times$ less compute and $36$% less memory at iso-accuracy, and quantizing a sparse-and-wide network leads to lower performance degradation. When quantized to fixed-point arithmetic and deployed on the Intel Loihi 2 neuromorphic chip, sparse models demonstrate $42 \times$ lower latency and $149\times$ lower energy consumption compared to an iso-accuracy dense model on an edge GPU, providing hardware validation to the theoretical gains of unstructured sparsity.
Cite
Text
Pierro et al. "Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front." ICLR 2025 Workshops: SLLM, 2025.Markdown
[Pierro et al. "Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/pierro2025iclrw-sparse/)BibTeX
@inproceedings{pierro2025iclrw-sparse,
title = {{Sparse and Wide Linear RNNs Are at the Efficiency-Performance Pareto Front}},
author = {Pierro, Alessandro and Abreu, Steven and Timcheck, Jonathan and Stratmann, Philipp and Shrestha, Sumit Bam},
booktitle = {ICLR 2025 Workshops: SLLM},
year = {2025},
url = {https://mlanthology.org/iclrw/2025/pierro2025iclrw-sparse/}
}