Transformers Are Universal Predictors

Abstract

We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.

Cite

Text

Basu et al. "Transformers Are Universal Predictors." ICML 2023 Workshops: NCW, 2023.

Markdown

[Basu et al. "Transformers Are Universal Predictors." ICML 2023 Workshops: NCW, 2023.](https://mlanthology.org/icmlw/2023/basu2023icmlw-transformers/)

BibTeX

@inproceedings{basu2023icmlw-transformers,
  title     = {{Transformers Are Universal Predictors}},
  author    = {Basu, Sourya and Choraria, Moulik and Varshney, Lav R.},
  booktitle = {ICML 2023 Workshops: NCW},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/basu2023icmlw-transformers/}
}