Shift-and-Sum Quantization for Visual Autoregressive Models

Abstract

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention–value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, in-painting, out-painting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

Cite

Text

Moon and Ham. "Shift-and-Sum Quantization for Visual Autoregressive Models." International Conference on Learning Representations, 2026.

Markdown

[Moon and Ham. "Shift-and-Sum Quantization for Visual Autoregressive Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/moon2026iclr-shiftandsum/)

BibTeX

@inproceedings{moon2026iclr-shiftandsum,
  title     = {{Shift-and-Sum Quantization for Visual Autoregressive Models}},
  author    = {Moon, Jaehyeon and Ham, Bumsub},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/moon2026iclr-shiftandsum/}
}