Faster Diffusion Through Temporal Attention Decomposition

Abstract

We explore the role of the attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. Self-attention, however, initially plays a minor role but becomes increasingly important in the second phase. These findings yield a simple and training-free method called TGATE which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experiments show TGATE’s broad applicability to various existing text-conditional diffusion models which it speeds up by 10-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.

Cite

Text

Liu et al. "Faster Diffusion Through Temporal Attention Decomposition." Transactions on Machine Learning Research, 2025.

Markdown

[Liu et al. "Faster Diffusion Through Temporal Attention Decomposition." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/liu2025tmlr-faster/)

BibTeX

@article{liu2025tmlr-faster,
  title     = {{Faster Diffusion Through Temporal Attention Decomposition}},
  author    = {Liu, Haozhe and Zhang, Wentian and Xie, Jinheng and Faccio, Francesco and Xu, Mengmeng and Xiang, Tao and Shou, Mike Zheng and Perez-Rua, Juan-Manuel and Schmidhuber, Jürgen},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/liu2025tmlr-faster/}
}