On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, Feng Zheng

ICML 2025 pp. 34943-34975

/icml/2025/li2025icml-generalization/

Abstract

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

PDF ICML OpenReview Semantic Scholar

Cite

Text

Li et al. "On the Generalization Ability of Next-Token-Prediction Pretraining." Proceedings of the 42nd International Conference on Machine Learning, 2025.

Markdown

[Li et al. "On the Generalization Ability of Next-Token-Prediction Pretraining." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/li2025icml-generalization/)

BibTeX

@inproceedings{li2025icml-generalization,
  title     = {{On the Generalization Ability of Next-Token-Prediction Pretraining}},
  author    = {Li, Zhihao and Jiang, Xue and Liu, Liyuan and Zhang, Xuelin and Chen, Hong and Zheng, Feng},
  booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
  year      = {2025},
  pages     = {34943-34975},
  volume    = {267},
  url       = {https://mlanthology.org/icml/2025/li2025icml-generalization/}
}