LittleBit: Ultra Low-Bit Quantization via Latent Factorization
Abstract
The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a persistent challenge. In this paper, we introduce LittleBit, a novel framework for extreme LLM compression. We target quantization rates as low as $0.1$ bits per weight (BPW), achieving a memory reduction of approximately $31\times$, which effectively compresses Llama2-13B to under $0.9$ GB. We represent weights via low-rank latent matrix factorization and subsequently binarize the resulting factors. To counteract the information loss inherent to such drastic precision reduction, we integrate a multi-scale compensation mechanism that learns importance parameters across row, column, and latent dimensions. Two primary contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and Residual Compensation to minimize approximation errors. Extensive experiments confirm the superiority of LittleBit in the sub-1-bit domain; for instance, our method at $0.1$ BPW surpasses the performance of leading techniques operating at $0.7$ BPW on Llama2-7B. We establish a new size-performance trade-off---unlocking a potential $11.6\times$ inference speedup relative to FP16---and render powerful LLMs practical for resource-constrained environments. Our code is available at https://github.com/SamsungLabs/LittleBit.
Cite
Text
Lee et al. "LittleBit: Ultra Low-Bit Quantization via Latent Factorization." Advances in Neural Information Processing Systems, 2025.Markdown
[Lee et al. "LittleBit: Ultra Low-Bit Quantization via Latent Factorization." Advances in Neural Information Processing Systems, 2025.](https://mlanthology.org/neurips/2025/lee2025neurips-littlebit/)BibTeX
@inproceedings{lee2025neurips-littlebit,
title = {{LittleBit: Ultra Low-Bit Quantization via Latent Factorization}},
author = {Lee, Banseok and Kim, Dongkyu and You, Youngcheon and Kim, Young-Min},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025},
url = {https://mlanthology.org/neurips/2025/lee2025neurips-littlebit/}
}