Scaling Laws and Efficient Inference for Ternary Language Models

Abstract

Large language models (LLMs) are increasingly deployed across research and industry applications, yet their high inference cost poses a major challenge. In this work, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements as a potential solution. We present three key contributions: (1) a comprehensive scaling law analysis showing these models benefit more from scaling training data compared to their floating point counterparts; (2) the introduction of Spectra-1.1, an open-source family of state-of-the-art TriLMs trained on up to 1.2 trillion tokens, demonstrating competitive performance with Llama-1 7B; and (3) ternary kernels for efficient inference, utilizing novel 1.6-bit and 2-bit packing schemes. Notably, our GPU kernel using 2-bit packing, called TriRun, achieves up to an 8$\times$ speedup over float16 baselines, enabling efficient inference in memory-constrained environments. We will be releasing the Spectra-1.1 models along with optimized inference kernels to encourage further research on TriLM models.

Cite

Text

Vaidhya et al. "Scaling Laws and Efficient Inference for Ternary Language Models." ICLR 2025 Workshops: SLLM, 2025.

Markdown

[Vaidhya et al. "Scaling Laws and Efficient Inference for Ternary Language Models." ICLR 2025 Workshops: SLLM, 2025.](https://mlanthology.org/iclrw/2025/vaidhya2025iclrw-scaling/)

BibTeX

@inproceedings{vaidhya2025iclrw-scaling,
  title     = {{Scaling Laws and Efficient Inference for Ternary Language Models}},
  author    = {Vaidhya, Tejas and Kaushal, Ayush and Jain, Vineet and Couture-Harpin, Francis and Shishodia, Prashant and Behbahani, Majid and Rish, Irina and Nevmyvaka, Yuriy},
  booktitle = {ICLR 2025 Workshops: SLLM},
  year      = {2025},
  url       = {https://mlanthology.org/iclrw/2025/vaidhya2025iclrw-scaling/}
}