Transformers Are Universal Predictors
Abstract
We find limits to the Transformer architecture for language modeling and show it has a universal prediction property in an information-theoretic sense. We further analyze their performance in non-asymptotic data regimes to understand the role of various components of the Transformer architecture, especially in the context of data-efficient training. We validate our theoretical analysis with experiments on both synthetic and real datasets.
Cite
Text
Basu et al. "Transformers Are Universal Predictors." ICML 2023 Workshops: NCW, 2023.Markdown
[Basu et al. "Transformers Are Universal Predictors." ICML 2023 Workshops: NCW, 2023.](https://mlanthology.org/icmlw/2023/basu2023icmlw-transformers/)BibTeX
@inproceedings{basu2023icmlw-transformers,
title = {{Transformers Are Universal Predictors}},
author = {Basu, Sourya and Choraria, Moulik and Varshney, Lav R.},
booktitle = {ICML 2023 Workshops: NCW},
year = {2023},
url = {https://mlanthology.org/icmlw/2023/basu2023icmlw-transformers/}
}