Compressed Sparse Tiles for Memory-Efficient Unstructured and Semi-Structured Sparsity

Abstract

Storing the weights of Large Language Models (LLMs) in GPU memory for local inference is challenging due to their size. While quantization has proven successful in reducing the memory footprint of LLMs, unstructured pruning introduces overhead by requiring the non-pruned weights' location to be encoded. This overhead hinders the efficient combination of quantization and unstructured pruning, especially for smaller batch sizes common in inference scenarios. To address this, we propose the \textsc{CS256} storage format, which offers a better balance between space efficiency and hardware acceleration compared to existing formats. CS256 partitions the weight matrix into tiles and uses a hierarchical indexing scheme to locate non-zero values, reducing the overhead associated with storing sparsity patterns. Our preliminary results with one-shot pruning of LLMs show that CS256 matches the performance of unstructured sparsity while being more hardware-friendly. Our code is available at: https://github.com/mklasby/llm-compressor/tree/mklasby-cs256

Cite

Text

Lasby et al. "Compressed Sparse Tiles for Memory-Efficient Unstructured and Semi-Structured Sparsity." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Lasby et al. "Compressed Sparse Tiles for Memory-Efficient Unstructured and Semi-Structured Sparsity." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/lasby2025iclrw-compressed/)

BibTeX

@inproceedings{lasby2025iclrw-compressed,
  title     = {{Compressed Sparse Tiles for Memory-Efficient Unstructured and Semi-Structured Sparsity}},
  author    = {Lasby, Mike and Zimmer, Max and Pokutta, Sebastian and Schultheis, Erik},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/lasby2025iclrw-compressed/}
}