Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression

Abstract

Transformers excel at *in-context learning* (ICL)---learning from demonstrations without parameter updates---but how they do so remains a mystery. Recent work suggests that Transformers may internally run Gradient Descent (GD), a first-order optimization method, to perform ICL. In this paper, we instead demonstrate that Transformers learn to approximate second-order optimization methods for ICL. For in-context linear regression, Transformers share a similar convergence rate as *Iterative Newton's Method*, both *exponentially* faster than GD. Empirically, predictions from successive Transformer layers closely match different iterations of Newton’s Method linearly, with each middle layer roughly computing 3 iterations; thus, Transformers and Newton’s method converge at roughly the same rate. In contrast, Gradient Descent converges exponentially more slowly. We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds. Finally, to corroborate our empirical findings, we prove that Transformers can implement $k$ iterations of Newton's method with $k + \mathcal O(1)$ layers.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Fu et al. "Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression." Neural Information Processing Systems, 2024. doi:10.52202/079017-3132

Markdown

[Fu et al. "Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/fu2024neurips-transformers/) doi:10.52202/079017-3132

BibTeX

@inproceedings{fu2024neurips-transformers,
  title     = {{Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression}},
  author    = {Fu, Deqing and Chen, Tian-Qi and Jia, Robin and Sharan, Vatsal},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3132},
  url       = {https://mlanthology.org/neurips/2024/fu2024neurips-transformers/}
}