Emergence of Meta-Stable Clustering in Mean-Field Transformer Models
Abstract
We model the evolution of tokens within a deep stack of Transformer layers as a continuous-time flow on the unit sphere, governed by a mean-field interacting particle system, building on the framework introduced in Geshkovski et al. (2023). Studying the corresponding mean-field Partial Differential Equation (PDE), which can be interpreted as a Wasserstein gradient flow, in this paper we provide a mathematical investigation of the long-term behavior of this system, with a particular focus on the emergence and persistence of meta-stable phases and clustering phenomena, key elements in applications like next-token prediction. More specifically, we perform a perturbative analysis of the mean-field PDE around the iid uniform initialization and prove that, in the limit of large number of tokens, the model remains close to a meta-stable manifold of solutions with a given structure (e.g., periodicity). Further, the structure characterizing the meta-stable manifold is explicitly identified, as a function of the inverse temperature parameter of the model, by the index maximizing a certain rescaling of Gegenbauer polynomials.
Cite
Text
Bruno et al. "Emergence of Meta-Stable Clustering in Mean-Field Transformer Models." International Conference on Learning Representations, 2025.Markdown
[Bruno et al. "Emergence of Meta-Stable Clustering in Mean-Field Transformer Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/bruno2025iclr-emergence/)BibTeX
@inproceedings{bruno2025iclr-emergence,
title = {{Emergence of Meta-Stable Clustering in Mean-Field Transformer Models}},
author = {Bruno, Giuseppe and Pasqualotto, Federico and Agazzi, Andrea},
booktitle = {International Conference on Learning Representations},
year = {2025},
url = {https://mlanthology.org/iclr/2025/bruno2025iclr-emergence/}
}