Hyperspherical Latents Improve Continuous-Token Autoregressive Generation

Abstract

Autoregressive (AR) models are promising for image generation, yet continuous-token AR variants often trail latent diffusion and masked-generation models. The core issue is heterogeneous variance in VAE latents, which is amplified during AR decoding, especially under classifier-free guidance (CFG), and can cause variance collapse. We propose SphereAR to address this issue. Its core design is to constrain all AR inputs and outputs---including after CFG---to lie on a fixed-radius hypersphere (constant $\ell_2$ norm), leveraging hyperspherical VAEs. Our theoretical analysis shows that hyperspherical constraint removes the scale component (the primary cause of variance collapse), thereby stabilizing AR decoding. Empirically, on ImageNet generation, SphereAR-H (943M) sets a new state of the art for AR models, achieving FID 1.34. Even at smaller scales, SphereAR-L (479M) reaches FID 1.54 and SphereAR-B (208M) reaches 1.92, matching or surpassing much larger baselines such as MAR-H (943M, 1.55) and VAR-d30 (2B, 1.92). To our knowledge, this is the first time a pure next-token AR image generator with raster order surpasses diffusion and masked-generation models at comparable parameter scales.

Cite

Text

Ke and Xue. "Hyperspherical Latents Improve Continuous-Token Autoregressive Generation." International Conference on Learning Representations, 2026.

Markdown

[Ke and Xue. "Hyperspherical Latents Improve Continuous-Token Autoregressive Generation." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/ke2026iclr-hyperspherical/)

BibTeX

@inproceedings{ke2026iclr-hyperspherical,
  title     = {{Hyperspherical Latents Improve Continuous-Token Autoregressive Generation}},
  author    = {Ke, Guolin and Xue, Hui},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/ke2026iclr-hyperspherical/}
}