Generating Efficient Kernels for Quantized Inference on Large Language Models
Abstract
We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
Cite
Text
Pegolotti et al. "Generating Efficient Kernels for Quantized Inference on Large Language Models." ICML 2023 Workshops: ES-FoMO, 2023.Markdown
[Pegolotti et al. "Generating Efficient Kernels for Quantized Inference on Large Language Models." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/pegolotti2023icmlw-generating/)BibTeX
@inproceedings{pegolotti2023icmlw-generating,
title = {{Generating Efficient Kernels for Quantized Inference on Large Language Models}},
author = {Pegolotti, Tommaso and Frantar, Elias and Alistarh, Dan and Püschel, Markus},
booktitle = {ICML 2023 Workshops: ES-FoMO},
year = {2023},
url = {https://mlanthology.org/icmlw/2023/pegolotti2023icmlw-generating/}
}