Encoder-Only Next Token Prediction

Abstract

Next-token prediction is conventionally done using decoder-only Transformers with causal attention, as this approach allows for efficient reuse of keys and values. What if we were not compute-limited, should we still use decoder-only Transformers? In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP in settings with unbounded compute. We introduce the $\operatorname{Count3}$ task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate the superior performance of ENTP across representative tasks where next-token prediction based Transformers can be evaluated, including addition, in-context learning, and language modeling.

Cite

Text

Ewer et al. "Encoder-Only Next Token Prediction." Transactions on Machine Learning Research, 2025.

Markdown

[Ewer et al. "Encoder-Only Next Token Prediction." Transactions on Machine Learning Research, 2025.](https://mlanthology.org/tmlr/2025/ewer2025tmlr-encoderonly/)

BibTeX

@article{ewer2025tmlr-encoderonly,
  title     = {{Encoder-Only Next Token Prediction}},
  author    = {Ewer, Ethan and Chae, Daewon and Zeng, Thomas and Kim, Jinkyu and Lee, Kangwook},
  journal   = {Transactions on Machine Learning Research},
  year      = {2025},
  url       = {https://mlanthology.org/tmlr/2025/ewer2025tmlr-encoderonly/}
}