Generalization Bound for a Shallow Transformer Trained Using Gradient Descent

Abstract

In this work, we establish a norm-based generalization bound for a shallow Transformer model trained via gradient descent under the bounded-drift (lazy training) regime, where model parameters remain close to their initialization throughout training. Our analysis proceeds in three stages: (a) we formally define a hypothesis class of Transformer models constrained to remain within a small neighborhood of their initialization; (b) we derive an upper bound on the Rademacher complexity of this class, quantifying its effective capacity; and (c) we establish an upper bound on the empirical loss achieved by gradient descent under suitable assumptions on model width, learning rate, and data structure. Combining these results, we obtain a high-probability bound on the true loss that decays sublinearly with the number of training samples $N$ and depends explicitly on model and data parameters. The resulting bound demonstrates that, in the lazy regime, wide and shallow Transformers generalize similarly to their linearized (NTK) counterparts. Empirical evaluations on both text and image datasets support the theoretical findings.

Cite

Text

Mwigo and Dasgupta. "Generalization Bound for a Shallow Transformer Trained Using Gradient Descent." Transactions on Machine Learning Research, 2026.

Markdown

[Mwigo and Dasgupta. "Generalization Bound for a Shallow Transformer Trained Using Gradient Descent." Transactions on Machine Learning Research, 2026.](https://mlanthology.org/tmlr/2026/mwigo2026tmlr-generalization/)

BibTeX

@article{mwigo2026tmlr-generalization,
  title     = {{Generalization Bound for a Shallow Transformer Trained Using Gradient Descent}},
  author    = {Mwigo, Brian and Dasgupta, Anirban},
  journal   = {Transactions on Machine Learning Research},
  year      = {2026},
  url       = {https://mlanthology.org/tmlr/2026/mwigo2026tmlr-generalization/}
}