Multi-Token Prediction Needs Registers

Abstract

Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes—ensuring compatibility with off-the-shelf pretrained language models—and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains.

Cite

Text

Gerontopoulos et al. "Multi-Token Prediction Needs Registers." Advances in Neural Information Processing Systems, 2025.

Markdown

[Gerontopoulos et al. "Multi-Token Prediction Needs Registers." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/gerontopoulos2025neurips-multitoken/)

BibTeX

@inproceedings{gerontopoulos2025neurips-multitoken,
  title     = {{Multi-Token Prediction Needs Registers}},
  author    = {Gerontopoulos, Anastasios and Gidaris, Spyros and Komodakis, Nikos},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
  url       = {https://mlanthology.org/neurips/2025/gerontopoulos2025neurips-multitoken/}
}