BiT: Robustly Binarized Multi-Distilled Transformer

Abstract

Modern pre-trained transformers have rapidly advanced the state-of-the-art in machine learning, but have also grown in parameters and computational complexity, making them increasingly difficult to deploy in resource-constrained environments. Binarization of the weights and activations of the network can significantly alleviate these issues, however, is technically challenging from an optimization perspective. In this work, we identify a series of improvements that enables binary transformers at a much higher accuracy than what was possible previously. These include a two-set binarization scheme, a novel elastic binary activation function with learned parameters, and a method to quantize a network to its limit by successively distilling higher precision models into lower precision students. These approaches allow for the first time, fully binarized transformer models that are at a practical level of accuracy, approaching a full-precision BERT baseline on the GLUE language understanding benchmark within as little as 5.9%. Code and models are available at:https://github.com/facebookresearch/bit.

Cite

Text

Liu et al. "BiT: Robustly Binarized Multi-Distilled Transformer." Neural Information Processing Systems, 2022.

Markdown

[Liu et al. "BiT: Robustly Binarized Multi-Distilled Transformer." Neural Information Processing Systems, 2022.](https://mlanthology.org/neurips/2022/liu2022neurips-bit/)

BibTeX

@inproceedings{liu2022neurips-bit,
  title     = {{BiT: Robustly Binarized Multi-Distilled Transformer}},
  author    = {Liu, Zechun and Oguz, Barlas and Pappu, Aasish and Xiao, Lin and Yih, Scott and Li, Meng and Krishnamoorthi, Raghuraman and Mehdad, Yashar},
  booktitle = {Neural Information Processing Systems},
  year      = {2022},
  url       = {https://mlanthology.org/neurips/2022/liu2022neurips-bit/}
}