Late-Phase Second-Order Training
Abstract
Towards the end of training, stochastic first-order methods such as SGD and ADAM go into diffusion and no longer make significant progress. In contrast, Newton-type methods are highly efficient "close" to the optimum, in the deterministic case. Therefore, these methods might turn out to be a particularly efficient tool for the final phase of training in the stochastic deep learning context as well. In our work, we study this idea by conducting an empirical comparison of a second-order Hessian-free optimizer and different first-order strategies with learning rate decays for late-phase training. We show that performing a few costly but precise second-order steps can outperform first-order alternatives in wall-clock runtime.
Cite
Text
Tatzel et al. "Late-Phase Second-Order Training." NeurIPS 2022 Workshops: HITY, 2022.Markdown
[Tatzel et al. "Late-Phase Second-Order Training." NeurIPS 2022 Workshops: HITY, 2022.](https://mlanthology.org/neuripsw/2022/tatzel2022neuripsw-latephase/)BibTeX
@inproceedings{tatzel2022neuripsw-latephase,
title = {{Late-Phase Second-Order Training}},
author = {Tatzel, Lukas and Hennig, Philipp and Schneider, Frank},
booktitle = {NeurIPS 2022 Workshops: HITY},
year = {2022},
url = {https://mlanthology.org/neuripsw/2022/tatzel2022neuripsw-latephase/}
}