Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard softmax-operator-based attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for long-context training. Prior work tackles this challenge along two directions: (1) kernel-level optimizations, which accelerate dense and sparse attention operators; and (2) module-level strategies, often referred to as distributed attention or context parallel training, which scale attention across multiple devices. However, systematic evaluation still remains limited: operator-level comparisons are often incomplete, while context parallel strategies are typically framework-specific, with unclear performance analysis across contexts. To address these gaps, we propose a unified benchmark that integrates representative attention kernels and context parallel mechanisms with a modular and extensible interface for evaluation. The benchmark evaluates methods along two critical dimensions: (1) attention mask patterns, which strongly affect efficiency, scalability, and usability, and (2) sequence length and distributed scale, which determine performance under extreme long-context training. Through comprehensive experiments on the cluster of up to 96 GPUs, our benchmark enables reproducible comparisons, highlights method-specific trade-offs, and provides practical guidance for designing and deploying attention mechanisms in long-context LLM training.

Cite

Text

Bu et al. "Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism." International Conference on Learning Representations, 2026.

Markdown

[Bu et al. "Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/bu2026iclr-longcontext/)

BibTeX

@inproceedings{bu2026iclr-longcontext,
  title     = {{Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism}},
  author    = {Bu, Tao and Wang, Qiangang and Zeng, Bowen and Sun, Hanwen and Huang, Yunpeng and Cao, Chun and Xu, Jingwei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/bu2026iclr-longcontext/}
}