MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models
Abstract
Large Language Models (LLMs) are distinguished by their massive parameter counts, which typically result in significant redundancy. This work introduces MaskLLM, a learnable pruning method that establishes Semi-structured (or ``N:M'') Sparsity in LLMs, aimed at reducing computational overhead during inference. Instead of developing a new importance criterion, MaskLLM explicitly models N:M patterns as a learnable distribution through Gumbel Softmax sampling. This approach facilitates end-to-end training on large-scale datasets and offers two notable advantages: 1) High-quality Masks - our method effectively scales to large datasets and learns accurate masks; 2) Transferability - the probabilistic modeling of mask distribution enables the transfer learning of sparsity across domains or tasks. We assessed MaskLLM using 2:4 sparsity on various LLMs, including LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, and our empirical results show substantial improvements over state-of-the-art methods. For instance, leading approaches achieve a perplexity (PPL) of 10 or greater on Wikitext compared to the dense model's 5.12 PPL, but MaskLLM achieves a significantly lower 6.72 PPL solely by learning the masks with frozen weights. Furthermore, MaskLLM's learnable nature allows customized masks for lossless application of 2:4 sparsity to downstream tasks or domains. Code is available at https://github.com/NVlabs/MaskLLM.
Cite
Text
Fang et al. "MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models." Neural Information Processing Systems, 2024. doi:10.52202/079017-0248Markdown
[Fang et al. "MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/fang2024neurips-maskllm/) doi:10.52202/079017-0248BibTeX
@inproceedings{fang2024neurips-maskllm,
title = {{MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models}},
author = {Fang, Gongfan and Yin, Hongxu and Muralidharan, Saurav and Heinrich, Greg and Pool, Jeff and Kautz, Jan and Molchanov, Pavlo and Wang, Xinchao},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-0248},
url = {https://mlanthology.org/neurips/2024/fang2024neurips-maskllm/}
}