VORTA: Efficient Video Diffusion via Routing Sparse Attention
Abstract
Video diffusion transformers have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent acceleration methods enhance the efficiency by exploiting the local sparsity of attention scores; yet they often struggle with accelerating the long-range computation. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants. VORTA achieves an end-to-end speedup $1.76\times$ without loss of quality on VBench. Furthermore, it can seamlessly integrate with various other acceleration methods, such as model caching and step distillation, reaching up to speedup $14.41\times$ with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of video diffusion transformers in real-world settings. Codes and weights are available at https://github.com/wenhao728/VORTA.
Cite
Text
Sun et al. "VORTA: Efficient Video Diffusion via Routing Sparse Attention." Advances in Neural Information Processing Systems, 2025.Markdown
[Sun et al. "VORTA: Efficient Video Diffusion via Routing Sparse Attention." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/sun2025neurips-vorta/)BibTeX
@inproceedings{sun2025neurips-vorta,
title = {{VORTA: Efficient Video Diffusion via Routing Sparse Attention}},
author = {Sun, Wenhao and Tu, Rong-Cheng and Ding, Yifu and Liao, Jingyi and Jin, Zhao and Liu, Shunyu and Tao, Dacheng},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/sun2025neurips-vorta/}
}