Taming Transformer Without Using Learning Rate Warmup

Qi, Xianbiao; He, Yelin; Ye, Jiaquan; Li, Chun-Guang; Zi, Bojia; Dai, Xili; Zou, Qin; Xiao, Rong

Taming Transformer Without Using Learning Rate Warmup

Xianbiao Qi, Yelin He, Jiaquan Ye, Chun-Guang Li, Bojia Zi, Xili Dai, Qin Zou, Rong Xiao

ICLR 2025

/iclr/2025/qi2025iclr-taming/

Abstract

Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and an obviously lower learning rate, is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal a key problem behind model crash phenomenon in the training process, termed *spectral energy concentration* of ${W_q}^{\top} W_k$, which is the reason for a malignant entropy collapse, where ${W_q}$ and $W_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by *Weyl's Inequality*, we present a novel optimization strategy, \ie, making the weight updating in successive steps steady---if the ratio $\frac{\sigma_{1}(\nabla W_t)}{\sigma_{1}(W_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $\frac{\sigma_{1}(W_{t-1})}{\sigma_{1}(\nabla W_t)}$, where $\nabla W_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these (Transformer) models without using learning rate warmup.

PDF ICLR Semantic Scholar

Cite

Text

Qi et al. "Taming Transformer Without Using Learning Rate Warmup." International Conference on Learning Representations, 2025.

Markdown

[Qi et al. "Taming Transformer Without Using Learning Rate Warmup." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/qi2025iclr-taming/)

BibTeX

@inproceedings{qi2025iclr-taming,
  title     = {{Taming Transformer Without Using Learning Rate Warmup}},
  author    = {Qi, Xianbiao and He, Yelin and Ye, Jiaquan and Li, Chun-Guang and Zi, Bojia and Dai, Xili and Zou, Qin and Xiao, Rong},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/qi2025iclr-taming/}
}