Memory Efficient Optimizers with 4-Bit States

Abstract

Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.

Cite

Text

Li et al. "Memory Efficient Optimizers with 4-Bit States." Neural Information Processing Systems, 2023.

Markdown

[Li et al. "Memory Efficient Optimizers with 4-Bit States." Neural Information Processing Systems, 2023.](https://mlanthology.org/neurips/2023/li2023neurips-memory/)

BibTeX

@inproceedings{li2023neurips-memory,
  title     = {{Memory Efficient Optimizers with 4-Bit States}},
  author    = {Li, Bingrui and Chen, Jianfei and Zhu, Jun},
  booktitle = {Neural Information Processing Systems},
  year      = {2023},
  url       = {https://mlanthology.org/neurips/2023/li2023neurips-memory/}
}