Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

Abstract

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as $80\times$ across training horizons and $70\times$ across model sizes.

Cite

Text

Bu et al. "Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate." International Conference on Learning Representations, 2026.

Markdown

[Bu et al. "Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/bu2026iclr-convex/)

BibTeX

@inproceedings{bu2026iclr-convex,
  title     = {{Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate}},
  author    = {Bu, Zhiqi and Xu, Shiyun and Mao, Jialin},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/bu2026iclr-convex/}
}