RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

Abstract

Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward–backward pass of a single attention layer once the context exceeds $\sim 4$ million tokens on an NVIDIA GH200 (96 GB). We introduce **R**epeated **A**rrays-of-**C**ount **E**stimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and \emph{soft} Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to $64$K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon® Gold 5220R CPU in a single forward–backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today’s hardware.

Cite

Text

Joshi et al. "RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts." International Conference on Learning Representations, 2026.

Markdown

[Joshi et al. "RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/joshi2026iclr-race/)

BibTeX

@inproceedings{joshi2026iclr-race,
  title     = {{RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts}},
  author    = {Joshi, Sahil and Chowdhury, Agniva and Kanakamedala, Amar and Singh, Ekam and Tu, Evan and Shrivastava, Anshumali},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/joshi2026iclr-race/}
}