Shift-and-Sum Quantization for Visual Autoregressive Models
Abstract
Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention–value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, in-painting, out-painting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.
Cite
Text
Moon and Ham. "Shift-and-Sum Quantization for Visual Autoregressive Models." International Conference on Learning Representations, 2026.Markdown
[Moon and Ham. "Shift-and-Sum Quantization for Visual Autoregressive Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/moon2026iclr-shiftandsum/)BibTeX
@inproceedings{moon2026iclr-shiftandsum,
title = {{Shift-and-Sum Quantization for Visual Autoregressive Models}},
author = {Moon, Jaehyeon and Ham, Bumsub},
booktitle = {International Conference on Learning Representations},
year = {2026},
url = {https://mlanthology.org/iclr/2026/moon2026iclr-shiftandsum/}
}