Generating Efficient Kernels for Quantized Inference on Large Language Models

Abstract

We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.

Cite

Text

Pegolotti et al. "Generating Efficient Kernels for Quantized Inference on Large Language Models." ICML 2023 Workshops: ES-FoMO, 2023.

Markdown

[Pegolotti et al. "Generating Efficient Kernels for Quantized Inference on Large Language Models." ICML 2023 Workshops: ES-FoMO, 2023.](https://mlanthology.org/icmlw/2023/pegolotti2023icmlw-generating/)

BibTeX

@inproceedings{pegolotti2023icmlw-generating,
  title     = {{Generating Efficient Kernels for Quantized Inference on Large Language Models}},
  author    = {Pegolotti, Tommaso and Frantar, Elias and Alistarh, Dan and Püschel, Markus},
  booktitle = {ICML 2023 Workshops: ES-FoMO},
  year      = {2023},
  url       = {https://mlanthology.org/icmlw/2023/pegolotti2023icmlw-generating/}
}