Tequila: Trapping-Free Ternary Quantization for Large Language Models

Huang, Hong; Wu, Decheng; Cen, Rui; Yu, Guanghua; Li, Zonghang; Liu, Kai; Zhu, Jianchen; Chen, Peng; Liu, Xue; Wu, Dapeng

Tequila: Trapping-Free Ternary Quantization for Large Language Models

Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, Dapeng Wu

ICLR 2026

/iclr/2026/huang2026iclr-tequila/

Abstract

Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to -1, 0, 1, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as _**deadzone trapping**: a large number of weights are trapped at the deadzone boundary._ This occurs because these weights receive only noisy, less informative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose **Tequila**, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly _zero_ inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves $>4$% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within $<1$% gap) with an $3.0\times$ inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant .

PDF ICLR OpenReview Semantic Scholar

Cite

Text

Huang et al. "Tequila: Trapping-Free Ternary Quantization for Large Language Models." International Conference on Learning Representations, 2026.

Markdown

[Huang et al. "Tequila: Trapping-Free Ternary Quantization for Large Language Models." International Conference on Learning Representations, 2026.](https://mlanthology.org/iclr/2026/huang2026iclr-tequila/)

BibTeX

@inproceedings{huang2026iclr-tequila,
  title     = {{Tequila: Trapping-Free Ternary Quantization for Large Language Models}},
  author    = {Huang, Hong and Wu, Decheng and Cen, Rui and Yu, Guanghua and Li, Zonghang and Liu, Kai and Zhu, Jianchen and Chen, Peng and Liu, Xue and Wu, Dapeng},
  booktitle = {International Conference on Learning Representations},
  year      = {2026},
  url       = {https://mlanthology.org/iclr/2026/huang2026iclr-tequila/}
}