Taming Curvature: Architecture Warm-up for Stable Transformer Training

Abstract

Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.

Cite

Text

Ramasinghe et al. "Taming Curvature: Architecture Warm-up for Stable Transformer Training." International Conference on Learning Representations, 2026.

Markdown

[Ramasinghe et al. "Taming Curvature: Architecture Warm-up for Stable Transformer Training." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ramasinghe2026iclr-taming/)

BibTeX

@inproceedings{ramasinghe2026iclr-taming,
  title     = {{Taming Curvature: Architecture Warm-up for Stable Transformer Training}},
  author    = {Ramasinghe, Sameera and Ajanthan, Thalaiyasingam and Dolatabadi, Hadi Mohaghegh and Koneputugodage, Chamin P Hewa and Avraham, Gil and Shevchenko, Violetta and Zuo, Yan and Pajak, Karol and Long, Alexander},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ramasinghe2026iclr-taming/}
}