On Optimization Methods for Deep Learning
Abstract
The predominant methodology in training deep learning advocates the use of stochastic gradient descent methods (SGDs). Despite its ease of implementation, SGDs are difficult to tune and parallelize. These problems make it challenging to develop, debug and scale up deep learning algorithms with SGDs. In this paper, we show that more sophisticated off-the-shelf optimization methods such as Limited memory BFGS (L-BFGS) and Conjugate gradient (CG) with linesearch can significantly simplify and speed up the process of pretraining deep algorithms. In our experiments, the difference between L-BFGS/CG and SGDs are more pronounced if we consider algorithmic extensions (e.g., sparsity regularization) and hardware extensions (e.g., GPUs or computer clusters). Our experiments with distributed optimization support the use of L-BFGS with locally connected networks and convolutional neural networks. Using L-BFGS, our convolutional network model achieves 0.69\% on the standard MNIST dataset. This is a state-of-the-art result on MNIST among algorithms that do not use distortions or pretraining.
Cite
Text
Le et al. "On Optimization Methods for Deep Learning." International Conference on Machine Learning, 2011.Markdown
[Le et al. "On Optimization Methods for Deep Learning." International Conference on Machine Learning, 2011.](https://mlanthology.org/icml/2011/le2011icml-optimization/)BibTeX
@inproceedings{le2011icml-optimization,
title = {{On Optimization Methods for Deep Learning}},
author = {Le, Quoc V. and Ngiam, Jiquan and Coates, Adam and Lahiri, Ahbik and Prochnow, Bobby and Ng, Andrew Y.},
booktitle = {International Conference on Machine Learning},
year = {2011},
pages = {265-272},
url = {https://mlanthology.org/icml/2011/le2011icml-optimization/}
}