XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient

Abstract

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has been proposed to fit large NLP models on resource-constraint devices. However, to preserve the accuracy for such aggressive compression schemes, cutting-edge methods usually introduce complicated compression pipelines, e.g., multi-stage expensive knowledge distillation with extensive hyperparameter tuning. Also, they oftentimes focus less on smaller transformer models that have already been heavily compressed via knowledge distillation and lack a systematic study to show the effectiveness of their methods.In this paper, we perform a very comprehensive systematic study to measure the impact of many key hyperparameters and training strategies from previous. As a result, we find out that previous baselines for ultra-low bit precision quantization are significantly under-trained. Based on our study, we propose a simple yet effective compression pipeline for extreme compression. Our simplified pipeline demonstrates that(1) we can skip the pre-training knowledge distillation to obtain a 5-layer \bert while achieving better performance than previous state-of-the-art methods, like TinyBERT; (2) extreme quantization plus layer reduction is able to reduce the model size by 50x, resulting in new state-of-the-art results on GLUE tasks.

Cite

Text

Wu et al. "XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient." Neural Information Processing Systems, 2022.

Markdown

[Wu et al. "XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/wu2022neurips-xtc/)

BibTeX

@inproceedings{wu2022neurips-xtc,
  title     = {{XTC: Extreme Compression for Pre-Trained Transformers Made Simple and Efficient}},
  author    = {Wu, Xiaoxia and Yao, Zhewei and Zhang, Minjia and Li, Conglong and He, Yuxiong},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/wu2022neurips-xtc/}
}