The Implicit Bias of Gradient Descent on Nonseparable Data

Abstract

Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\cO(\nicefrac{\ln\ln t }{\ln t})$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\cO(\nicefrac{(\ln t)^2}{\sqrt{t}})$.

Cite

Text

Ji and Telgarsky. "The Implicit Bias of Gradient Descent on Nonseparable Data." Conference on Learning Theory, 2019.

Markdown

[Ji and Telgarsky. "The Implicit Bias of Gradient Descent on Nonseparable Data." Conference on Learning Theory, 2019.](https://mlanthology.org/colt/2019/ji2019colt-implicit/)

BibTeX

@inproceedings{ji2019colt-implicit,
  title     = {{The Implicit Bias of Gradient Descent on Nonseparable Data}},
  author    = {Ji, Ziwei and Telgarsky, Matus},
  booktitle = {Conference on Learning Theory},
  year      = {2019},
  pages     = {1772-1798},
  volume    = {99},
  url       = {https://mlanthology.org/colt/2019/ji2019colt-implicit/}
}