Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Abstract

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs.Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required.As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers.In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency.The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency.Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models. The code will be available at https: //github.com/SamsungLabs/aespa.

Cite

Text

Kim et al. "Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers." Neural Information Processing Systems, 2024. doi:10.52202/079017-2991

Markdown

[Kim et al. "Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/kim2024neurips-nextlevel/) doi:10.52202/079017-2991

BibTeX

@inproceedings{kim2024neurips-nextlevel,
  title     = {{Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers}},
  author    = {Kim, Junhan and Lee, Chungman and Cho, Eulrang and Park, Kyungphil and Kim, Ho-young and Kim, Joonyoung and Jeon, Yongkweon},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-2991},
  url       = {https://mlanthology.org/neurips/2024/kim2024neurips-nextlevel/}
}