DSA: Efficient Inference for Video Generation Models via Distributed Sparse Attention

Abstract

Diffusion Transformer models have driven the rapid advances in video generation, achieving state-of-the-art quality and flexibility. However, their attention mechanism remains a major performance bottleneck, as its dense computation scales quadratically with the sequence length. To overcome this limitation and reduce the generation latency, we propose DSA, a novel attention mechanism that integrates sparse attention with distributed inference for diffusion-based video generation. By leveraging carefully-designed parallelism strategies and scheduling, DSA significantly reduces redundant computation while preserving global context. Extensive experiments on benchmark datasets demonstrate that, when deployed on 8 GPUs, DSA achieves up to 1.43× inference speedup than the existing distributed method and 10.79× faster than single-GPU inference.

Cite

Text

Li et al. "DSA: Efficient Inference for Video Generation Models via Distributed Sparse Attention." International Conference on Learning Representations, 2026.

Markdown

[Li et al. "DSA: Efficient Inference for Video Generation Models via Distributed Sparse Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/li2026iclr-dsa/)

BibTeX

@inproceedings{li2026iclr-dsa,
  title     = {{DSA: Efficient Inference for Video Generation Models via Distributed Sparse Attention}},
  author    = {Li, Shenggui and Lu, Runyu and Chen, Qiaoling and Yin, Haiyan and Lyu, Yueming and Wen, Yonggang and Tsang, Ivor and Zhang, Tianwei},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/li2026iclr-dsa/}
}