SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

Zhang, Jintao; Wang, Haoxu; Jiang, Kai; Yang, Shuo; Zheng, Kaiwen; Xi, Haocheng; Wang, Ziteng; Zhu, Hongzhou; Zhao, Min; Stoica, Ion; Gonzalez, Joseph E.; Chen, Jianfei; Zhu, Jun

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu

ICLR 2026

/iclr/2026/zhang2026iclr-sla/

Abstract

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. Interestingly, we find that attention weights can be decoupled into two matrices: a small fraction of large weights with high rank and the remaining weights with very low rank. This naturally suggests applying sparse acceleration to the first part and low-rank acceleration to the second. Based on this finding, we propose SLA (**S**parse-**L**inear **A**ttention), a trainable attention method that fuses sparse and linear attention to accelerate diffusion models. SLA classifies attention weights into critical, marginal, and negligible, applying $\mathcal{O}(N^2)$ attention to critical weights, $\mathcal{O}(N)$ attention to marginal weights, and skipping negligible ones. SLA combines these computations into a single GPU kernel and supports both forward and backward passes. With only a few fine-tuning steps using SLA, DiT models achieve a $\textbf{20x}$ reduction in attention computation, resulting in significant acceleration without loss of generation quality. Experiments show that SLA reduces attention computation by $\textbf{95}$\% without degrading end-to-end generation quality, outperforming baseline methods. In addition, we implement an efficient GPU kernel for SLA, which yields a $\textbf{13.7x}$ speedup in attention computation and a $\textbf{2.2x}$ end-to-end speedup in video generation on Wan2.1-1.3B. The code is available at https://github.com/thu-ml/SLA.

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Zhang et al. "SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention." International Conference on Learning Representations, 2026.

Markdown

[Zhang et al. "SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/zhang2026iclr-sla/)

BibTeX

@inproceedings{zhang2026iclr-sla,
  title     = {{SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention}},
  author    = {Zhang, Jintao and Wang, Haoxu and Jiang, Kai and Yang, Shuo and Zheng, Kaiwen and Xi, Haocheng and Wang, Ziteng and Zhu, Hongzhou and Zhao, Min and Stoica, Ion and Gonzalez, Joseph E. and Chen, Jianfei and Zhu, Jun},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/zhang2026iclr-sla/}
}