Late-Phase Second-Order Training

Abstract

Towards the end of training, stochastic first-order methods such as SGD and ADAM go into diffusion and no longer make significant progress. In contrast, Newton-type methods are highly efficient "close" to the optimum, in the deterministic case. Therefore, these methods might turn out to be a particularly efficient tool for the final phase of training in the stochastic deep learning context as well. In our work, we study this idea by conducting an empirical comparison of a second-order Hessian-free optimizer and different first-order strategies with learning rate decays for late-phase training. We show that performing a few costly but precise second-order steps can outperform first-order alternatives in wall-clock runtime.

Cite

Text

Tatzel et al. "Late-Phase Second-Order Training." NeurIPS 2022 Workshops: HITY, 2022.

Markdown

[Tatzel et al. "Late-Phase Second-Order Training." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/tatzel2022neuripsw-latephase/)

BibTeX

@inproceedings{tatzel2022neuripsw-latephase,
  title     = {{Late-Phase Second-Order Training}},
  author    = {Tatzel, Lukas and Hennig, Philipp and Schneider, Frank},
  booktitle = {NeurIPS 2022 Workshops: HITY},
  year      = {2022},
  url       = {https://mlanthology.org/neuripsw/2022/tatzel2022neuripsw-latephase/}
}