QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation

Abstract

One main approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still largely open. In this paper, we advance this state-of-the-art for QAT via a new method called QuEST, which is Pareto-competitive with FP16, that is, it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations, and is compatible with weight sparsity. Experiments on Llama-type architectures show that QuEST induces new, stable scaling laws across the entire range of hardware-supported compressed representations. Moreover, we provide GPU kernel support showing that the models produced by QuEST can be efficiently executed on current hardware.

Cite

Text

Panferov et al. "QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Panferov et al. "QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/panferov2025iclrw-quest/)

BibTeX

@inproceedings{panferov2025iclrw-quest,
  title     = {{QuEST: Training Accurate LLMs over Highly-Compressed Weights and Activation}},
  author    = {Panferov, Andrei and Chen, Jiale and Tabesh, Soroush and Castro, Roberto L. and Nikdan, Mahdi and Alistarh, Dan},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/panferov2025iclrw-quest/}
}