PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Abstract

There has been significant interest in "extreme" compression of large language models (LLMs), i.e. to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs.We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases.On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama-2 family models at 2 bits per parameter.

Cite

Text

Malinovskii et al. "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression." Neural Information Processing Systems, 2024. doi:10.52202/079017-0165

Markdown

[Malinovskii et al. "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/malinovskii2024neurips-pvtuning/) doi:10.52202/079017-0165

BibTeX

@inproceedings{malinovskii2024neurips-pvtuning,
  title     = {{PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression}},
  author    = {Malinovskii, Vladimir and Mazur, Denis and Ilin, Ivan and Kuznedelev, Denis and Burlachenko, Konstantin and Yi, Kai and Alistarh, Dan and Richtarik, Peter},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-0165},
  url       = {https://mlanthology.org/neurips/2024/malinovskii2024neurips-pvtuning/}
}