Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Ren, Sucheng; Yu, Qihang; He, Ju; Shen, Xiaohui; Yuille, Alan; Chen, Liang-Chieh

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh Chen

ICCV 2025 pp. 15781-15791

/iccv/2025/ren2025iccv-beyond/

Abstract

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a "token" is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a kxk grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous entity regression, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B, outperforms DiT-XL/SiT-XL while achieving 20xfaster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2xfaster than the previous best-performing model without relying on vision foundation modules (e.g., DINOv2) or advanced guidance interval sampling. Codes is publicly available at https://oliverrensu.github.io/project/xAR.

PDF ICCV Semantic Scholar

Cite

Text

Ren et al. "Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation." International Conference on Computer Vision, 2025.

Markdown

[Ren et al. "Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation." International Conference on Computer Vision, 2025.](https://mlanthology.org/iccv/2025/ren2025iccv-beyond/)

BibTeX

@inproceedings{ren2025iccv-beyond,
  title     = {{Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation}},
  author    = {Ren, Sucheng and Yu, Qihang and He, Ju and Shen, Xiaohui and Yuille, Alan and Chen, Liang-Chieh},
  booktitle = {International Conference on Computer Vision},
  year      = {2025},
  pages     = {15781-15791},
  url       = {https://mlanthology.org/iccv/2025/ren2025iccv-beyond/}
}