NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-Add-Free Attention

Abstract

Large Language Model (LLM) inference on Central Processing Units (CPU) is challenging due to the vast quantities of Multiply-Add (MAD) matrix operations in the attention computations. This paper highlights a rare gem in modern CPUs, Single-Instruction-Multiple-Data (SIMD) registers, which allows for ultra-low-latency lookups in a batch. We leverage this unique capability to propose NoMAD-Attention, an efficient attention algorithm that replaces MAD operations with in-register lookups. Through hardware-aware algorithmic designs, NoMAD-Attention achieves the computation of attention scores using repeated fast accesses to SIMD registers. NoMAD-Attention works with pre-trained attention-based LLMs without model finetuning. Extensive empirical evaluations demonstrate that NoMAD-Attention maintains the quality of the original LLMs well and speeds up the 4-bit quantized LLaMA-7B-based model by up to $2 \times$ at 16k context length.

Cite

Text

Zhang et al. "NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-Add-Free Attention." Neural Information Processing Systems, 2024. doi:10.52202/079017-3581

Markdown

[Zhang et al. "NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-Add-Free Attention." Neural Information Processing Systems, 2024.](https://mlanthology.org/neurips/2024/zhang2024neurips-nomadattention/) doi:10.52202/079017-3581

BibTeX

@inproceedings{zhang2024neurips-nomadattention,
  title     = {{NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-Add-Free Attention}},
  author    = {Zhang, Tianyi and Yi, Jonah and Yao, Bowen and Xu, Zhaozhuo and Shrivastava, Anshumali},
  booktitle = {Neural Information Processing Systems},
  year      = {2024},
  doi       = {10.52202/079017-3581},
  url       = {https://mlanthology.org/neurips/2024/zhang2024neurips-nomadattention/}
}