Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Abstract

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by introducing a framework based on Mixture of Transition Distributions, where a latent variable determines the influence of past tokens on the next. The distribution over this latent variable is parameterized by unobserved mixture weights that transformers must learn in-context. We demonstrate that transformers can implement Mirror Descent to learn these weights from the context. Specifically, we give an explicit construction of a three-layer transformer that exactly implements one step of Mirror Descent and prove that the resulting estimator is a first-order approximation of the Bayes-optimal predictor. Corroborating our construction and its learnability via gradient descent, we empirically show that transformers trained from scratch learn solutions consistent with our theory: their predictive distributions, attention patterns, and learned transition matrix closely match the construction, while deeper models achieve performance comparable to multi-step Mirror Descent.

Cite

Text

D'Angelo and Flammarion. "Transformers Learn Latent Mixture Models In-Context via Mirror Descent." International Conference on Learning Representations, 2026.

Markdown

[D'Angelo and Flammarion. "Transformers Learn Latent Mixture Models In-Context via Mirror Descent." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/dangelo2026iclr-transformers/)

BibTeX

@inproceedings{dangelo2026iclr-transformers,
  title     = {{Transformers Learn Latent Mixture Models In-Context via Mirror Descent}},
  author    = {D'Angelo, Francesco and Flammarion, Nicolas},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/dangelo2026iclr-transformers/}
}