Binarized Neural Machine Translation

Abstract

The rapid scaling of language models is motivating research using low-bitwidth quantization.In this work, we propose a novel binarization technique for Transformers applied to machine translation (BMT), the first of its kind. We identify and address the problem of inflated dot-product variance when using one-bit weights and activations. Specifically, BMT leverages additional LayerNorms and residual connections to improve binarization quality. Experiments on the WMT dataset show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16$\times$ smaller in size. One-bit activations incur varying degrees of quality drop, but mitigated by the proposed architectural changes. We further conduct a scaling law study using production-scale translation datasets, which shows that one-bit weight Transformers scale and generalize well in both in-domain and out-of-domain settings. Implementation in JAX/Flax will be open sourced.

Cite

Text

Zhang et al. "Binarized Neural Machine Translation." Neural Information Processing Systems, 2023.

Markdown

[Zhang et al. "Binarized Neural Machine Translation." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/zhang2023neurips-binarized/)

BibTeX

@inproceedings{zhang2023neurips-binarized,
  title     = {{Binarized Neural Machine Translation}},
  author    = {Zhang, Yichi and Garg, Ankush and Cao, Yuan and Lew, Lukasz and Ghorbani, Behrooz and Zhang, Zhiru and Firat, Orhan},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/zhang2023neurips-binarized/}
}