MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention
Abstract
Transformers have achieved state-of-the-art performance across various tasks, but suffer from a notable quadratic complexity in sequence length due to the attention mechanism. In this work, we propose MonarchAttention -- a novel approach to sub-quadratic attention approximation via Monarch matrices, an expressive class of structured matrices. Based on the variational form of softmax, we describe an efficient optimization-based algorithm to compute an approximate projection of softmax attention onto the class of Monarch matrices with $\Theta(N\sqrt{N} d)$ computational complexity and $\Theta(Nd)$ memory/IO complexity. Unlike previous approaches, MonarchAttention is both (1) transferable, yielding minimal performance loss with no additional training, even when replacing every attention layer of the transformer, and (2) hardware-efficient, utilizing the highest-throughput tensor core units on modern GPUs. With optimized kernels, MonarchAttention achieves substantial speed-ups in wall-time over FlashAttention-2: $1.4\times$ for shorter sequences $(N=256)$, $4.5\times$ for medium-length sequences $(N=4K)$, and $8.2\times$ for longer sequences $(N=16K)$. We demonstrate the quality of MonarchAttention on diverse tasks and architectures in vision and language problems, showing that it flexibly and accurately approximates softmax attention in a variety of contexts.
Cite
Text
Yaras et al. "MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention." Advances in Neural Information Processing Systems, 2025.Markdown
[Yaras et al. "MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/yaras2025neurips-monarchattention/)BibTeX
@inproceedings{yaras2025neurips-monarchattention,
title = {{MonarchAttention: Zero-Shot Conversion to Fast, Hardware-Aware Structured Attention}},
author = {Yaras, Can and Xu, Alec S and Abillama, Pierre and Lee, Changwoo and Balzano, Laura},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/yaras2025neurips-monarchattention/}
}