The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

Abstract

Transformers have become the cornerstone of modern AI. Unlike traditional architectures, transformers exhibit a distinctive characteristic: diverse types of building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feed-forward networks, work collaboratively. Understanding the disparities and interactions among these blocks is therefore important. In this paper, we uncover a clear sharpness disparity across these blocks, which intriguingly emerges early in training and persists throughout the training process. Building on this insight, we propose a novel Blockwise Learning Rate (LR) strategy to accelerate large language model (LLM) pre-training. Specifically, by integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2\times$ speedup compared to vanilla AdamW. This improvement is demonstrated across GPT-2 and LLaMA models, with model sizes ranging from 0.12B to 1.1B and datasets including OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2\times$ speedup and $2\times$ memory savings. These results underscore the potential of leveraging the sharpness disparity principle to improve LLM training.

Cite

Text

Wang et al. "The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Wang et al. "The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/wang2025icml-sharpness/)

BibTeX

@inproceedings{wang2025icml-sharpness,
  title     = {{The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training}},
  author    = {Wang, Jinbo and Wang, Mingze and Zhou, Zhanpeng and Yan, Junchi and E, Weinan and Wu, Lei},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {64859-64879},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/wang2025icml-sharpness/}
}