FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion
Abstract
Diffusion generative models have become the standard for producing high-quality, coherent video content, yet their slow inference speeds and high computational demands hinder practical deployment. Although both quantization and sparsity can independently accelerate inference while maintaining generation quality, naively combining these techniques in existing training-free approaches leads to significant performance degradation, as they fail to achieve proper joint optimization. We introduce FPSAttention, a novel training-aware co-design of FP8 quantization and Sparsity for video generation, with a focus on the 3D bi-directional attention mechanism. Our approach features three key innovations: 1) A unified 3D tile-wise granularity that simultaneously supports both quantization and sparsity. 2) A denoising step-aware strategy that adapts to the noise schedule, addressing the strong correlation between quantization/sparsity errors and denoising steps. 3) A native, hardware-friendly kernel that leverages FlashAttention and is implemented with optimized Hopper architecture features, enabling highly efficient execution. Trained on Wan2.1's 1.3B and 14B models and evaluated on the vBench benchmark, FPSAttention achieves a 7.09$\times$ kernel speedup for attention operations and a 4.96$\times$ end-to-end speedup for video generation compared to the BF16 baseline at 720p resolution—without sacrificing generation quality.
Cite
Text
Liu et al. "FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion." Advances in Neural Information Processing Systems, 2025.Markdown
[Liu et al. "FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/liu2025neurips-fpsattention/)BibTeX
@inproceedings{liu2025neurips-fpsattention,
title = {{FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion}},
author = {Liu, Akide and Zhang, Zeyu and Li, Zhexin and Bai, Xuehai and Xing, Yuanjie and Han, Yizeng and Tang, Jiasheng and Wu, Jichao and Yang, Mingyang and Chen, Weihua and He, Jiahao and He, Yuanyu and Wang, Fan and Haffari, Gholamreza and Zhuang, Bohan},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/liu2025neurips-fpsattention/}
}