ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
Abstract
How to efficiently serve ever-larger trained natural language models in practice has become exceptionally challenging even for powerful cloud servers due to their prohibitive memory/computation requirements.In this work, we present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as \OURS. \OURS is an end-to-end quantization and inference pipeline with three main components: (1) a fine-grained hardware-friendly quantization scheme for both weight and activations; (2) a novel affordable layer-by-layer knowledge distillation algorithm (\lwd) even without the original training data access;(3) a highly-optimized quantization system backend support to remove the quantization/dequantization overhead.As such, we are able to show that:(1) \OURS can reduce the precision for weight and activations to INT8 in a cost-free way for both \bert and \gpt-style models with minimal accuracy impact, which leads to up to 5.19x/4.16x speedup on \bert/\gpt-style models compared to FP16 inference, separately;(2) \OURS plus \lwd can affordably quantize the weights in the fully-connected module to INT4 along with INT8 weights in the attention module and INT8 activations, resulting in 3x memory footprint reduction compared to the FP16 model;(3) \OURS can be directly applied to two of the largest open-sourced language models, including \gptneox, for which our INT8 model achieves similar accuracy as the FP16 model but achieves 5.2x better efficiency.Our code is open-sourced at~\cite{code_compression}.
Cite
Text
Yao et al. "ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers." Neural Information Processing Systems, 2022.Markdown
[Yao et al. "ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/yao2022neurips-zeroquant/)BibTeX
@inproceedings{yao2022neurips-zeroquant,
title = {{ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers}},
author = {Yao, Zhewei and Aminabadi, Reza Yazdani and Zhang, Minjia and Wu, Xiaoxia and Li, Conglong and He, Yuxiong},
booktitle = {Neural Information Processing Systems},
year = {2022},
url = {https://mlanthology.org/neurips/2022/yao2022neurips-zeroquant/}
}