BAR: Refactor the Basis of Autoregressive Visual Generation

Abstract

Autoregressive (AR) models, despite their remarkable successes, encounter limitations in image generation due to sequential prediction of tokens, e.g. local image patches, in a predetermined row-major raster-scan order. Prior works improve AR with various designs of prediction units and orders, however, rely on human inductive biases. This work proposes Basis Autoregressive (BAR), a novel paradigm that conceptualizes tokens as basis vectors within the image space and employs an end-to-end learnable approach to transform basis. By viewing tokens $x_k$ as the projection of image $\mathbf{x}$ onto basis vectors $e_k$, BAR's unified framework refactors fixed token sequences through the linear transform $\mathbf{y}=\mathbf{Ax}$, and encompasses previous methods as specific instances of matrix $\mathbf{A}$. Furthermore, BAR adaptively optimizes the transform matrix with an end-to-end AR objective, thereby discovering effective strategies beyond hand-crafted assumptions. Comprehensive experiments, notably achieving a state-of-the-art FID of 1.15 on the ImageNet-256 benchmark, demonstrate the ability of BAR to overcome human biases and significantly advance image generation, including text-to-image synthesis.

Cite

Text

Tang et al. "BAR: Refactor the Basis of Autoregressive Visual Generation." International Conference on Learning Representations, 2026.

Markdown

[Tang et al. "BAR: Refactor the Basis of Autoregressive Visual Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/tang2026iclr-bar/)

BibTeX

@inproceedings{tang2026iclr-bar,
  title     = {{BAR: Refactor the Basis of Autoregressive Visual Generation}},
  author    = {Tang, Zhicong and Chen, Dong and Bao, Jianmin and Guo, Baining},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/tang2026iclr-bar/}
}