FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision
Abstract
Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU.We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with BF16 reaching up to 840 TFLOPs/s (85\% utilization), and with FP8 reaching 1.3 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.
Cite
Text
Shah et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." Neural Information Processing Systems, 2024. doi:10.52202/079017-2193Markdown
[Shah et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/shah2024neurips-flashattention3/) doi:10.52202/079017-2193BibTeX
@inproceedings{shah2024neurips-flashattention3,
title = {{FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision}},
author = {Shah, Jay and Bikshandi, Ganesh and Zhang, Ying and Thakkar, Vijay and Ramani, Pradeep and Dao, Tri},
booktitle = {Neural Information Processing Systems},
year = {2024},
doi = {10.52202/079017-2193},
url = {https://mlanthology.org/neurips/2024/shah2024neurips-flashattention3/}
}