S2-Attention: Hardware-Aware Context Sharding Among Attention Heads

Abstract

Sparse attention, which selectively attends to a subset of tokens in the context, has been an established approach to enhance the efficiency of Transformers. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts, mainly due to the lack of hardware-level optimizations like FlashAttention (Dao, 2023). Meanwhile, it remains unclear whether sparse attention can maintain the model’s quality at the scale of today’s large language models (LLMs), and how this can be achieved. This paper presents Sparsely-Sharded Attention (S2-ATTENTION), an optimized Triton kernel library providing a variety of customizable sparse attention implementations for both training and inference. S2-ATTENTION allows customizing the attention patterns at per head per context range level. The fresh insights from S2-ATTENTION inspire a novel sparse attention architecture that meets several desiderata that we find crucial for achieving both practical efficiency gains and strong accuracy on downstream tasks, called as Head-Heterogenous Strided Transformer (HHST). For higher sparsity, HHST shards the context heterogeneously across attention heads, where each head attends to a different subset of tokens while collectively covering the whole. We evaluate HHST by pretraining 1.3B and 7B sized models. For attention computation, HHST with S2-ATTENTION achieves 8.8× and 15.9× wall-clock attention speedup, as well as 2.8× and 2.5× training time reduction compared to a dense attention baseline implemented with FlashAttention-2. Moreover, HHST’s downstream task performance is on-par with dense attention, and achieves a perfect retrieval accuracy at a 128K context length at 7B scale. At inference, our 7B HHST, achieves a 4.5× speed-up compared to the dense counterparts in vLLM. S2- ATTENTION is released with easy-to-customize APIs for direct usage in Megatron and vLLM.

Cite

Text

Lin et al. "S2-Attention: Hardware-Aware Context Sharding Among Attention Heads." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Lin et al. "S2-Attention: Hardware-Aware Context Sharding Among Attention Heads." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/lin2025iclrw-s2attention/)

BibTeX

@inproceedings{lin2025iclrw-s2attention,
  title     = {{S2-Attention: Hardware-Aware Context Sharding Among Attention Heads}},
  author    = {Lin, Xihui and Zhang, Yunan and Ge, Suyu and Ren, Liliang and Patra, Barun and Chaudhary, Vishrav and Peng, Hao and Song, Xia},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/lin2025iclrw-s2attention/}
}