Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Abstract

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input size. I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. Successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence guided by the implicit curriculum.

Cite

Text

Mușat. "Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers." International Conference on Learning Representations, 2025.

Markdown

[Mușat. "Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/musat2025iclr-mechanism/)

BibTeX

@inproceedings{musat2025iclr-mechanism,
  title     = {{Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers}},
  author    = {Mușat, Tiberiu},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/musat2025iclr-mechanism/}
}