Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions in Context

Cheng, Xiang; Chen, Yuxin; Sra, Suvrit

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions in Context

ICML 2024 pp. 8002-8037

/icml/2024/cheng2024icml-transformers/

Abstract

Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in function space, which in turn enable them to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures and non-linear in-context learning tasks. Additionally, we show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Cheng et al. "Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions in Context." International Conference on Machine Learning, 2024.

Markdown

[Cheng et al. "Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions in Context." International Conference on Machine Learning, 2024.](https://mlanthology.org/icml/2024/cheng2024icml-transformers/)

BibTeX

@inproceedings{cheng2024icml-transformers,
  title     = {{Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions in Context}},
  author    = {Cheng, Xiang and Chen, Yuxin and Sra, Suvrit},
  booktitle = {International Conference on Machine Learning},
  year      = {2024},
  pages     = {8002-8037},
  volume    = {235},
  url       = {https://mlanthology.org/icml/2024/cheng2024icml-transformers/}
}