Star Attention: Efficient LLM Inference over Long Sequences

Shantanu Acharya, Fei Jia, Boris Ginsburg

ICML 2025 pp. 356-371

/icml/2025/acharya2025icml-star/

Abstract

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Acharya et al. "Star Attention: Efficient LLM Inference over Long Sequences." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Acharya et al. "Star Attention: Efficient LLM Inference over Long Sequences." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/acharya2025icml-star/)

BibTeX

@inproceedings{acharya2025icml-star,
  title     = {{Star Attention: Efficient LLM Inference over Long Sequences}},
  author    = {Acharya, Shantanu and Jia, Fei and Ginsburg, Boris},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {356-371},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/acharya2025icml-star/}
}