Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Abstract

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

Cite

Text

Qiu and Yao. "Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention." International Conference on Learning Representations, 2026.

Markdown

[Qiu and Yao. "Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/qiu2026iclr-lowprecision/)

BibTeX

@inproceedings{qiu2026iclr-lowprecision,
  title     = {{Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention}},
  author    = {Qiu, Haiquan and Yao, Quanming},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/qiu2026iclr-lowprecision/}
}