Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Abstract

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

Cite

Text

Shen et al. "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT." AAAI Conference on Artificial Intelligence, 2020. doi:10.1609/AAAI.V34I05.6409

Markdown

[Shen et al. "Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT." AAAI Conference on Artificial Intelligence, 2020.](https://mlanthology.org/aaai/2020/shen2020aaai-q/) doi:10.1609/AAAI.V34I05.6409

BibTeX

@inproceedings{shen2020aaai-q,
  title     = {{Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT}},
  author    = {Shen, Sheng and Dong, Zhen and Ye, Jiayu and Ma, Linjian and Yao, Zhewei and Gholami, Amir and Mahoney, Michael W. and Keutzer, Kurt},
  booktitle = {AAAI Conference on Artificial Intelligence},
  year      = {2020},
  pages     = {8815-8821},
  doi       = {10.1609/AAAI.V34I05.6409},
  url       = {https://mlanthology.org/aaai/2020/shen2020aaai-q/}
}