Direct Quantized Training of Language Models with Stochastic Rounding
Abstract
Although recent quantized Large Language Models, such as BitNet, have paved the way for significant reduction in memory usage during deployment with binary or ternary weights, training these models still demands substantial memory footprints. This is partly because high-precision (i.e., unquantized) weights required for straight-through estimation must be maintained throughout the whole training process. To address this, we explore directly updating the quantized low-precision weights without relying on straight-through estima- tion during backpropagation, aiming to save memory usage during training. Specifically, we employ a stochastic rounding technique to minimize the information loss caused by the use of low-bit weights throughout training. Experimental results on our LLaMA-structured models of various sizes indicate that (1) training with only low-precision weights is feasible even when they are constrained to ternary values; (2) extending the bit width to 8 bits achieves performance on par with BitNet b1.58; (3) our models remain robust to precision scaling and memory reduction, showing minimal performance degradation when moving from FP32 to lower-memory environments (BF16/FP8); and (4) our models also support inference using ternary weights, showcasing their flexibility in deployment.
Cite
Text
Zhao et al. "Direct Quantized Training of Language Models with Stochastic Rounding." Proceedings of the 17th Asian Conference on Machine Learning, 2025.Markdown
[Zhao et al. "Direct Quantized Training of Language Models with Stochastic Rounding." Proceedings of the 17th Asian Conference on Machine Learning, 2025.](https://mlanthology.org/acml/2025/zhao2025acml-direct/)BibTeX
@inproceedings{zhao2025acml-direct,
title = {{Direct Quantized Training of Language Models with Stochastic Rounding}},
author = {Zhao, Kaiyan and Tabaru, Tsuguchika and Kobayashi, Kenichi and Honda, Takumi and Yamazaki, Masafumi and Tsuruoka, Yoshimasa},
booktitle = {Proceedings of the 17th Asian Conference on Machine Learning},
year = {2025},
pages = {1150-1165},
volume = {304},
url = {https://mlanthology.org/acml/2025/zhao2025acml-direct/}
}