Long-Context Generalization with Sparse Attention
Abstract
Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.
Cite
Text
Vasylenko et al. "Long-Context Generalization with Sparse Attention." International Conference on Learning Representations, 2026.Markdown
[Vasylenko et al. "Long-Context Generalization with Sparse Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/vasylenko2026iclr-longcontext/)BibTeX
@inproceedings{vasylenko2026iclr-longcontext,
title = {{Long-Context Generalization with Sparse Attention}},
author = {Vasylenko, Pavlo and Pitorro, Hugo and Martins, Andre and Treviso, Marcos Vinicius},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/vasylenko2026iclr-longcontext/}
}