LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

Abstract

Visual Autoregressive (VAR) has emerged as a promising approach in image generation, offering competitive potential and performance comparable to diffusion-based models. However, current AR-based visual generation models require substantial computational resources, limiting their applicability on resource-constrained devices. To address this issue, we conducted analysis and identified significant redundancy in three dimensions of the VAR model: (1) the attention map, (2) the attention outputs when using classifier free guidance, and (3) the data precision. Correspondingly, we proposed efficient attention mechanism and low-bit quantization method to enhance the efficiency of VAR models while maintaining performance. With negligible performance lost (less than 0.056 FID increase), we could achieve 85.2% reduction in attention computation, 50% reduction in overall memory and 1.5x latency reduction. To ensure deployment feasibility, we developed efficient training-free compression techniques and analyze the deployment feasibility and efficiency gain of each technique.

Cite

Text

Xie et al. "LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization." NeurIPS 2024 Workshops: Compression, 2024.

Markdown

[Xie et al. "LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization." NeurIPS 2024 Workshops: Compression, 2024.](https://mlanthology.org/neuripsw/2024/xie2024neuripsw-litevar/)

BibTeX

@inproceedings{xie2024neuripsw-litevar,
  title     = {{LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization}},
  author    = {Xie, Rui and Zhao, Tianchen and Yuan, Zhihang and Wan, Rui and Gao, Wenxi and Zhu, Zhenhua and Ning, Xuefei and Wang, Yu},
  booktitle = {NeurIPS 2024 Workshops: Compression},
  year      = {2024},
  url       = {https://mlanthology.org/neuripsw/2024/xie2024neuripsw-litevar/}
}