Transformers on Markov Data: Constant Depth Suffices

Abstract

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from $k^{\text{th}}$-order Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from $k^{\text{th}}$-order Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer’s ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, we prove that a transformer with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. Surprisingly, with the addition of layer normalization, we show that a transformer with a constant number of layers can represent the in-context conditional empirical distribution, concurring with our empirical observations. This result provides more insight into the benefit of soft-attention and non-linearities in the transformer architecture.

PDF NeurIPS OpenReview Semantic Scholar

Cite

Text

Rajaraman et al. "Transformers on Markov Data: Constant Depth Suffices." Neural Information Processing Systems, 2024. doi:10.52202/079017-4369

Markdown

[Rajaraman et al. "Transformers on Markov Data: Constant Depth Suffices." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/rajaraman2024neurips-transformers/) doi:10.52202/079017-4369

BibTeX

@inproceedings{rajaraman2024neurips-transformers,
  title     = {{Transformers on Markov Data: Constant Depth Suffices}},
  author    = {Rajaraman, Nived and Bondaschi, Marco and Ramchandran, Kannan and Gastpar, Michael and Makkuva, Ashok Vardhan},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-4369},
  url       = {https://mlanthology.org/neurips/2024/rajaraman2024neurips-transformers/}
}