Where Do Large Learning Rates Lead Us? a Feature Learning Perspective

Abstract

It is a conventional wisdom that using large learning rates (LRs) early in training improves generalization. Following a line of research devoted to understanding this effect mechanistically, we conduct an empirical study in a controlled setting focusing on the feature learning properties of training with different initial LRs. We show that the range of initial LRs providing the best generalization of the final solution results in a sparse set of learned features, with a clear focus on those most relevant for the task. In contrast, training starting with too small LRs attempts to learn all features simultaneously, resulting in poor generalization. Conversely, using initial LRs that are too large fails to extract meaningful patterns from the data.

Cite

Text

Sadrtdinov et al. "Where Do Large Learning Rates Lead Us? a Feature Learning Perspective." ICML 2024 Workshops: HiLD, 2024.

Markdown

[Sadrtdinov et al. "Where Do Large Learning Rates Lead Us? a Feature Learning Perspective." ICML 2024 Workshops: HiLD, 2024.](https://mlanthology.org/icmlw/2024/sadrtdinov2024icmlw-large/)

BibTeX

@inproceedings{sadrtdinov2024icmlw-large,
  title     = {{Where Do Large Learning Rates Lead Us? a Feature Learning Perspective}},
  author    = {Sadrtdinov, Ildus and Kodryan, Maxim and Pokonechny, Eduard and Lobacheva, Ekaterina and Vetrov, Dmitry},
  booktitle = {ICML 2024 Workshops: HiLD},
  year      = {2024},
  url       = {https://mlanthology.org/icmlw/2024/sadrtdinov2024icmlw-large/}
}