On the Generalization Ability of Next-Token-Prediction Pretraining
Abstract
Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model’s generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.
Cite
Text
Li et al. "On the Generalization Ability of Next-Token-Prediction Pretraining." Proceedings of the 42nd International Conference on Machine Learning, 2025.Markdown
[Li et al. "On the Generalization Ability of Next-Token-Prediction Pretraining." Proceedings of the 42nd International Conference on Machine Learning, 2025.](https://mlanthology.org/icml/2025/li2025icml-generalization/)BibTeX
@inproceedings{li2025icml-generalization,
title = {{On the Generalization Ability of Next-Token-Prediction Pretraining}},
author = {Li, Zhihao and Jiang, Xue and Liu, Liyuan and Zhang, Xuelin and Chen, Hong and Zheng, Feng},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
year = {2025},
pages = {34943-34975},
volume = {267},
url = {https://mlanthology.org/icml/2025/li2025icml-generalization/}
}