Understanding Gradient Clipping in Incremental Gradient Methods

Abstract

We provide a theoretical analysis on how gradient clipping affects the convergence of the incremental gradient methods on minimizing an objective function that is the sum of a large number of component functions. We show that clipping on gradients of component functions leads to bias on the descent direction, which is affected by the clipping threshold, the norms of gradients of component functions, together with the angles between gradients of component functions and the full gradient. We then propose some sufficient conditions under which the increment gradient methods with gradient clipping can be shown to be convergent under the more general relaxed smoothness assumption. We also empirically observe that the angles between gradients of component functions and the full gradient generally decrease as the batchsize increases, which may help to explain why larger batchsizes generally lead to faster convergence in training deep neural networks with gradient clipping.

Cite

Text

Qian et al. "Understanding Gradient Clipping in Incremental Gradient Methods." Artificial Intelligence and Statistics, 2021.

Markdown

[Qian et al. "Understanding Gradient Clipping in Incremental Gradient Methods." Artificial Intelligence and Statistics, 2021.](https://mlanthology.org/aistats/2021/qian2021aistats-understanding/)

BibTeX

@inproceedings{qian2021aistats-understanding,
  title     = {{Understanding Gradient Clipping in Incremental Gradient Methods}},
  author    = {Qian, Jiang and Wu, Yuren and Zhuang, Bojin and Wang, Shaojun and Xiao, Jing},
  booktitle = {Artificial Intelligence and Statistics},
  year      = {2021},
  pages     = {1504-1512},
  volume    = {130},
  url       = {https://mlanthology.org/aistats/2021/qian2021aistats-understanding/}
}