Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork
Abstract
Large Language Models (LLMs) achieve state-of-the-art performance but are costly to deploy in resource-constrained environments. Pruning with $n:m$ semi-structured sparsity reduces computation and enables hardware acceleration, yet existing methods face a trade-off: one-shot approaches are efficient but heuristic, while optimization-based methods are accurate but expensive. We introduce \textbf{HyperPrune}, a resource-efficient framework that directly optimizes $n:m$ sparsity. A lightweight hypernetwork, shared across layers and conditioned on learnable embeddings, generates structured masks in a one-shot, layer-wise manner. \textit{Continual pruning} preserves cross-layer knowledge, and \textit{feature outlier regularization} retains critical activations, unifying the strengths of heuristic and optimization-based methods. Experiments on LLaMA-7B to 70B show state-of-the-art accuracy–sparsity trade-offs on a single A100 GPU, achieving higher efficiency, accuracy, and scalability than prior approaches. HyperPrune offers a practical, scalable, and hardware-friendly solution for structured LLM pruning.
Cite
Text
Sun and Sakuma. "Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork." International Conference on Learning Representations, 2026.Markdown
[Sun and Sakuma. "Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/sun2026iclr-learning/)BibTeX
@inproceedings{sun2026iclr-learning,
title = {{Learning Semi-Structured Sparsity for LLMs via Shared and Context-Aware Hypernetwork}},
author = {Sun, Lu and Sakuma, Jun},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/sun2026iclr-learning/}
}