Long-Context Generalization with Sparse Attention

Vasylenko, Pavlo; Pitorro, Hugo; Martins, Andre; Treviso, Marcos Vinicius

Long-Context Generalization with Sparse Attention

Pavlo Vasylenko, Hugo Pitorro, Andre Martins, Marcos Vinicius Treviso

ICLR 2026

/iclr/2026/vasylenko2026iclr-longcontext/

Abstract

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Vasylenko et al. "Long-Context Generalization with Sparse Attention." International Conference on Learning Representations, 2026.

Markdown

[Vasylenko et al. "Long-Context Generalization with Sparse Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/vasylenko2026iclr-longcontext/)

BibTeX

@inproceedings{vasylenko2026iclr-longcontext,
  title     = {{Long-Context Generalization with Sparse Attention}},
  author    = {Vasylenko, Pavlo and Pitorro, Hugo and Martins, Andre and Treviso, Marcos Vinicius},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/vasylenko2026iclr-longcontext/}
}