Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization
Abstract
The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with a longer chain-of-thought (CoT). In this work, we present a theoretical analysis of transformers learning on synthetic state-tracking tasks with gradient descent. Specifically: 1). We prove how the *algebraic structure* of state-tracking problems governs the length generalization of learned reasoning in transformers. In doing so, we formulate the **attention concentration** mechanism, linking the retrieval robustness of the attention layer to the task structure of long-context state tracking problems. 2). Moreover, we prove that a transformer can provably *self-improve* via a *recursive self-training* scheme that progressively extends the range of solvable problem lengths. We show that the model can achieve abilities outside the coverage of the base model in recursive training, different from prior theoretical works on self-improvement. To our knowledge, we provide the first *optimization guarantee* that constant-depth transformers provably learn $\text{NC}^1$-complete problems with CoT, significantly going beyond prior art confined in $\text{TC}^0$, unless the widely held conjecture $\text{TC}^0 \neq \text{NC}^1$ fails. Finally, we present a broad set of experiments supporting our theoretical results, confirming the length generalization behaviors and the mechanism of attention concentration.
Cite
Text
Huang et al. "Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization." Advances in Neural Information Processing Systems, 2025.Markdown
[Huang et al. "Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/huang2025neurips-transformers/)BibTeX
@inproceedings{huang2025neurips-transformers,
title = {{Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization}},
author = {Huang, Yu and Wen, Zixin and Singh, Aarti and Chi, Yuejie and Chen, Yuxin},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/huang2025neurips-transformers/}
}