Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Abstract

The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{https://github.com/shaochenze/PatchTrain}.

Cite

Text

Shao et al. "Beyond Next Token Prediction: Patch-Level Training for Large Language Models." International Conference on Learning Representations, 2025.

Markdown

[Shao et al. "Beyond Next Token Prediction: Patch-Level Training for Large Language Models." International Conference on Learning Representations, 2025.](https://mlanthology.org/iclr/2025/shao2025iclr-beyond/)

BibTeX

@inproceedings{shao2025iclr-beyond,
  title     = {{Beyond Next Token Prediction: Patch-Level Training for Large Language Models}},
  author    = {Shao, Chenze and Meng, Fandong and Zhou, Jie},
  booktitle = {International Conference on Learning Representations},
  year      = {2025},
  url       = {https://mlanthology.org/iclr/2025/shao2025iclr-beyond/}
}