FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention

Abstract

The quadratic complexity of softmax attention poses a major bottleneck for long-context modeling, motivating a surge of linear attention variants with linear complexity. Unlike softmax attention, which benefits from optimized kernels, linear attention lacks general-purpose, hardware-efficient support and scalable distributed implementations. We introduce **Flex**ible **L**inear **A**ttention (FlexLA), a domain-specific compiler that automates the generation of high-performance, scalable kernels for a wide range of linear attention models directly from high-level PyTorch code. At its core, FlexLA employs an intuitive programming abstraction that decomposes any linear attention algorithm into three canonical phases: intra-chunk computation, inter-chunk state propagation, and output merging. This unified abstraction enables FlexLA to perform domain-specific optimizations, automatically generating kernels that fuse computation and communication at a fine-grained tile level and eliminating host synchronization. Our evaluation demonstrates that FlexLA combines programmability with performance: a wide range of linear attention variants can be implemented in just a few dozen lines of code, while the generated kernels deliver 1.01x-4.9x the performance of sate-of-the-art expert-optimized library and scale with near-linear efficiency on scalar gated linear attention to 16 million tokens on 128 GPUs, surpassing the state-of-the-art distributed baseline by up to 7.2x.

Cite

Text

Duanmu et al. "FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention." International Conference on Learning Representations, 2026.

Markdown

[Duanmu et al. "FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/duanmu2026iclr-flexlinearattention/)

BibTeX

@inproceedings{duanmu2026iclr-flexlinearattention,
  title     = {{FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention}},
  author    = {Duanmu, Haojie and Zheng, Size and Zheng, Ningxin and Lu, Jianqiao and Zheng, Xuegui and Zhang, Xingcheng and Chang, Li-Wen and Liu, Xin and Lin, Dahua},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/duanmu2026iclr-flexlinearattention/}
}