The Pitfalls of Next-Token Prediction

Abstract

Can a mere next-token predictor faithfully model human intelligence? Our work is aimed at crystallizing this intuitive concern, which is currently fragmented in the literature. As a starting point, we advocate isolating the two phases of next-token prediction that are often conflated: autoregression during inference vs. teacher-forcing during training. We argue that the previously-identified problem of "exponential error accumulation" is a symptom of autoregressive inference. We then identify a more concerning problem: teacher-forcing can let the model fit the training data by _cheating_, causing total in-distribution failure during inference. We design a minimal planning task where empirically both the Transformer and the Mamba architecture fail in this manner --- remarkably, despite the task being easy to learn. Our work consolidates these and other essential arguments surrounding next-token prediction. We hope our effort can ground the next-token prediction debate and inspire further explorations beyond this paradigm.

Cite

Text

Bachmann and Nagarajan. "The Pitfalls of Next-Token Prediction." ICLR 2024 Workshops: AGI, 2024.

Markdown

[Bachmann and Nagarajan. "The Pitfalls of Next-Token Prediction." ICLR 2024 Workshops: AGI, 2024.](https://mlanthology.org/iclrw/2024/bachmann2024iclrw-pitfalls/)

BibTeX

@inproceedings{bachmann2024iclrw-pitfalls,
  title     = {{The Pitfalls of Next-Token Prediction}},
  author    = {Bachmann, Gregor and Nagarajan, Vaishnavh},
  booktitle = {ICLR 2024 Workshops: AGI},
  year      = {2024},
  url       = {https://mlanthology.org/iclrw/2024/bachmann2024iclrw-pitfalls/}
}